Basic regular expressions

A syntax for describing patterns of text, i.e. Steroids for Grep

Regular expressions are patterns for describing text that we want to find. Their syntax represents a mini-language of their own (though not a complete programming language) that has to be memorized. But the additional complexity allows us to greatly expand the way we search and filter text.

At the most basic level, the use of regular expression is no different than doing a Ctrl-F to activate the "Find" function in your word processor. I like to think of regular expressions as simply, "finding text, on steroids."

The easiest way to utilize regular expressions is through grep (check out the basic tutorial on grep if you haven't already) and by using its extended regular expression option, i.e. -E

For the purpose of this guide, we'll use the following excerpt (referred to as excerpt.txt) from President Obama's Jan. 10, 2015, weekly address, "Resurgence is Real". I've reformatted it to put each sentence on its own line (this is due to a constraint of grep, which can only match patterns by line):

Hi, everybody.
About a year ago, I promised that 2014 would be a breakthrough year for America.
And this week, we got more evidence to back that up.
In December, our businesses created 240,000 new jobs.
The unemployment rate fell to 5.6%.
That means that 2014 was the strongest year for job growth since the 1990s.
In 2014, unemployment fell faster than it has in three decades.
Over a 58-month streak, our businesses have created 11.2 million new jobs.
After a decade of decline, American manufacturing is in its best stretch of job growth since the '90s.
America is now the world’s number one producer of oil and gas, helping to save drivers about a buck-ten a gallon at the pump over this time last year.
Thanks to the Affordable Care Act, about 10 million Americans have gained health insurance in the past year alone.
We have cut our deficits by about two-thirds.
And after 13 long years, our war in Afghanistan has come to a responsible end, and more of our brave troops have come home.

Literal strings

What sets a regular expression apart from your typical find-text function is its use of metacharacters to describe patterns to match, such as, any numerical digit or any non-alphabetical character or the end of a line.

But if we specify a regular expression without any metacharacters, then it simply acts as a straightforward text-finding function. The following invocation of grep and its extended regex option, -E, simply looks for 2014 in excerpt.txt – another way I like to phrase this is that grep is searching for the pattern that contains the literal string, "2014":

grep -E '2014' excerpt.txt

By default, grep outputs every line that contains 2014:

About a year ago, I promised that 2014 would be a breakthrough year for America.
That means that 2014 was the strongest year for job growth since the 1990s.
In 2014, unemployment fell faster than it has in three decades.

Let's use the grep option, -o, to show only the exact match:

grep -oE '2014' excerpt.txt

The output:

2014
2014
2014

Character classes

Matching literal strings is nice when you know exactly what you want. But many times, we have no idea, especially when searching thousands or millions of text files and strings.

So instead of searching for the literal string, 2014, what if we could search for any sequence of four numerical digits? For example, to see if Obama mentioned any other year in his speech?

The regular expression syntax for the character class of numerical digits is simply: \d.

To reiterate the first example with 2014, we were searching for the literal pattern of 2014. In the following example that uses [[:digit:]], we are not searching for the literal pattern of [[:digit:]]. Instead, that pattern, [[:digit:]], is the regular expression syntax for match a numerical digit.

So, to find four numerical digits in a row:

grep -oE "[[:digit:]][[:digit:]][[:digit:]][[:digit:]]" excerpt.txt
2014
2014
1990
2014

Here's a few of the character classes that work with grep (and other Unix tools, such as tr):

[[:alnum:]] - Alphanumeric characters, i.e. A to Z, 0 to 9 [[:alpha:]] - The English alphabet [[:digit:]] - Numbers 0 to 9 [[:lower:]] - Lower-case letters [[:punct:]] - Punctuation [[:space:]] - Space characters, including tabs and newlines [[:upper:]] - Upper-case letters

Repetition

Take a look at the four-digit pattern again:

    [[:digit:]][[:digit:]][[:digit:]][[:digit:]]

Let's refer to each instance of [[:digit:]] as being a token, i.e. the token, [[:digit:]], is repeated 4 times in the above pattern. As with most things in computing, anytime you see physical repetition, there's usually a shorthand version.

The plus sign

In regular expression, the plus sign is a metacharacter used to indicate that the previous token should be matched one or more times:

grep -oE '[[:digit:]]+' excerpt.txt

The result is a list of all strings that contain one or more consecutive numerical digits:

2014
240
000
5
6
2014
1990
2014
58
11
2
90
10
13

Try running the previous regular expression without the plus sign to see the difference.

Limited repetition

The plus sign gets us all numerical strings, whether they consist of one number or 1,000 numbers. But what if we wanted to just find all numerical strings that corresponded to years, i.e. just the four-digit numbers?

Instead of using the plus sign after the token that we want to match, we use curly braces to specify number of matches:

grep -oE '[[:digit:]]{4}' excerpt.txt
2014
2014
1990
2014

There's two variations of curly brace notation:

Match m to n repetitions:

To find all numerical strings that have at least 2 digits, but no more than 3:

grep -oE '[[:digit:]]{2,3}' excerpt.txt
201
240
000
201
199
201
58
11
90
10
13
Match at least m repetitions

In this variation, just leave off the maximum bound to specify that you want to specify that a minimum number of repetitions is found. The following example finds all strings with 3 or more digits:

grep -oE '[[:digit:]]{3,}' excerpt.txt
2014
240
000
2014
1990
2014

Word boundaries

If we want to match an exact number of repetitions, we need a more specific regex. For example the following will fail to capture only 2-digit numbers:

echo '42 100 8899' | grep -oE '[[:digit:]]{2}'
42
10
88
99

To solve this particular problem, we use word boundaries, which are denoted by \b ("backslash b"). The following regex matches only 2-digit strings that are surrounded by word boundaries, which can be anything from punctuation, spaces, and the beginning and end of a line:

echo '42 100 8899' | grep -oE '\b[[:digit:]]{2}\b'

Quick tip: The backslash character, \, is an important character among all the arcane characters of Unix (and many other contemporary programming languages). Often, when a backslash directly precedes any other character, such as \b, its effect is to escape that character. In this example, \b does not mean a literal "backslash, then the letter 'b'", but something entirely different: a word boundary. In general, when you see a backslash, and then another character, remember that the combination might have a special meaning.

Using word boundaries can be used to find things such as, all words that begin with the letter 'w' (note that I'm using grep's -i option to do a case-insensitive search:

grep -oiE '\bw[[:alpha:]]+' excerpt.txt
would
week
we
was
world
We
war

To match all words that end with w:

grep -oE '[[:alpha:]]+w\b' excerpt.txt
new
new
now

And all words that being and end with the letter a:

grep -oiE '\ba[[:alpha:]]+a\b' excerpt.txt
America
America

Character ranges and specific groups

This is similar to character classes, but a little more fine-grained. For example, to find all words that begin with letters w through z:

grep -oiE '\b[v-z][[:alpha:]]+' excerpt.txt
year
would
year
week
we
was
year
world
year
year
We
years
war

Or you can specify a group of arbitrary characters: here's how to find all words that end in y, g, or k (notice that the order of the desired letters doesn't matter):

grep -oiE '[[:alpha:]]+[ygk]\b' excerpt.txt

A very common pattern is to match all numerical strings, including ones that contain a comma separator, i.e "1,999,999", and/or have a decimal point:

grep -oE '[[:digit:]][0-9,.]+'

The output isn't exactly what we want – notice how it captures 2014, – but close enough for now:

2014
240,000
5.6
2014
1990
2014,
58
11.2
90
10
13

Negated character sets

Sometimes, we want to match any character – with a few exceptions. The syntax for this is similar to character ranges, except we include a caret symbol to indicate that we want all characters except the ones enclosed in brackets.

To find all strings that have the sequence th not followed by a vowel:

grep -oiE '[a-z]+th[^aeiou]' excerpt.txt 
breakthr
growth 
month 
growth 
health 

This technique can be used to find all characters within quotation marks (as long as the quotes are contained on the same line). The following example can be read as: match a quotation mark, then every character that is not a quotation mark, until you find another quotation mark:

echo 'You say "Hello?" but I say "Goodbye!" -- Beetles' | grep -oE '"[^"]+"'
"Hello?"
"Goodbye!"

Match any character, period

In the previous example, the dot character (i.e., the period, or decimal point) represented a literal dot, but only because it was inside the square brackets used to denote a range or character class.

Normally, the dot character is a very powerful metacharacter: it stands in for any character.

For example, to find all four-character sequences that start begin with he and are followed by any two characters:

grep -oiE 'he..' excerpt.txt 
he u
he s
he 1
he ?
he w
help
he p
he A
heal
he p

As mentioned previously, the backslash character is used to "escape" a character's usual meaning. So the literal b, when preceded with a backslash, will match a word boundary. However, what about characters, such as the dot, that have a special meaning without a backslash?

In those cases, they are escaped out of their special status. Thus, \. will match a literal dot. To find the last word of every sentence that ends with a dot:

grep -oiE "[[:alpha:]]+\." excerpt.txt
everybody.
America.
up.
jobs.
s.
decades.
jobs.
s.
year.
alone.
thirds.
home.

Optional match

The question mark is used to indicate that the preceding token is optional. The following regex treats the letters n and s as optional:

grep -oE "American?s?" excerpt.txt
America
American
America
Americans

Beginning of a line

The caret character, when used outside of square brackets, denotes the beginning of a line. To find all lines that begin with a capital I:

grep -E '^I' excerpt.txt 
In December, our businesses created 240,000 new jobs.
In 2014, unemployment fell faster than it has in three decades.

End of a line

The _dollar sign__denotes the _end of a line. To find all lines that end with a y and a literal .:

grep -E 'y\.$' excerpt.txt 
Hi, everybody.

Limitations of command-line regular expressions

In this tutorial, we've only covered a small set of regular expression functionality. There's enough features within standard Unix-like operating systems to do some incredibly powerful matchmaking.

However, the standard Unix set of regular expressions doesn't have all of the fun available to modern languages such as Python and Ruby. The biggest limitation, however, is that Unix regular expressions do not (typically) match patterns that are split across multiple lines.