Advanced Shell Topics: Wildcards and Regular Expressions


The first time I heard someone talk about regular expressions I think I said something like "regular what?!?". I thought that they were talking about expressions like in programs or something. I found out that regular expressions can be thought of as very advanced wildcards. Basically you can think of them as pattern matching, which is actually an alternate name for them.

NOTE: The use of regular expressions as filename pattern matching in the shell is more limited than regular expressions that are interpreted by programs like grep, awk and sed. This is mainly due to the fact that characters like | (pipe), &, (), > and < have special meanings to the shell. When you use characters like these when passing them in arguments to a program, you will need to quote them using a backslash (\). Also, Perl regular expressions are a beast of their own. Perl uses basically the same syntax as POSIX regular expressions, but also has several extensions that will only work within Perl code.

We'll use grep to demonstrate regular expressions instead of straight pattern matching in the shell. I've borrowed the README file from Bruce Peren's Electric Fence program to demonstrate how difference regular expressions will match lines in the file with grep.

[suso@antonio adv_shell]$ cat electricfence-readme 
This is Electric Fence 2.1

Electric Fence is a different kind of malloc() debugger. It uses the virtual
memory hardware of your system to detect when software overruns the boundaries
of a malloc() buffer. It will also detect any accesses of memory that has
been released by free(). Because it uses the VM hardware for detection,
Electric Fence stops your program on the first instruction that causes
a bounds violation. It's then trivial to use a debugger to display the
offending statement.

This version will run on:
        Linux kernel version 1.1.83 and above. Earlier kernels have problems
        with the memory protection implementation.

        All System V Revision 4 platforms (and possibly earlier revisions)
        including:
                Every 386 System V I've heard of.
                Solaris 2.x
                SGI IRIX 5.0 (but not 4.x)

        IBM AIX on the RS/6000.

        SunOS 4.X (using an ANSI C compiler and probably static linking).

        HP/UX 9.01, and possibly earlier versions.

        OSF 1.3 (and possibly earlier versions) on a DECalpha.

On some of these platforms, you'll have to uncomment lines in the Makefile
that apply to your particular system.

If you test Electric Fence on a platform not mentioned here, please send me a
report.

It will probably port to any ANSI/POSIX system that provides mmap(), and
mprotect(), as long as mprotect() has the capability to turn off all access
to a memory page, and mmap() can use /dev/zero or the MAP_ANONYMOUS flag
to create virtual memory pages.

Complete information on the use of Electric Fence is in the manual page
libefence.3 .

        Thanks

        Bruce Perens
        Bruce@Pixar.com
[suso@antonio adv_shell]$

Kinda lengthy but it will do. Let's get started with a simple match to show you how grep works. If I just search for "Bruce" it returns two lines of text which match.

[suso@antonio adv_shell]$ grep "Bruce" electricfence-readme
        Bruce Perens
        Bruce@Pixar.com
[suso@antonio adv_shell]$

First let's talk about character sets. The [] (bracket) characters can be used in regular expressions to mean that you would like to match any character within the brackets. Let's say we want to match any line with the characters 1, 6 or B on it:

[suso@antonio adv_shell]$ grep "[16B]" electricfence-readme 
This is Electric Fence 2.1
been released by free(). Because it uses the VM hardware for detection,
        Linux kernel version 1.1.83 and above. Earlier kernels have problems
                Every 386 System V I've heard of.
        IBM AIX on the RS/6000.
        HP/UX 9.01, and possibly earlier versions.
        OSF 1.3 (and possibly earlier versions) on a DECalpha.
        Bruce Perens
        Bruce@Pixar.com
[suso@antonio adv_shell]$

You can also specify a range of characters to be matched in a character set by using a dash. If we wanted all lines that have numbers from 5 to 9, then we would use [5-9] as the character set pattern.

[suso@antonio adv_shell]$ grep "[5-9]" electricfence-readme 
        Linux kernel version 1.1.83 and above. Earlier kernels have problems
                Every 386 System V I've heard of.
                SGI IRIX 5.0 (but not 4.x)
        IBM AIX on the RS/6000.
        HP/UX 9.01, and possibly earlier versions.
[suso@antonio adv_shell]$

If you actually want to put the dash character inside of a character set pattern you have to put it as the last character in the group, like this: [123abc-]

Next let's talk about quantifiers. Quantifiers control how many instances of a character or character set preceding it can occur. One that you're already familiar with, the * (asteriks) character, means that it will match zero or more occurances of the preceding character or set. This usually isn't to helpful in matching alone, but is used more often when doing match and replace functions. The + (plus) character is another quantifier that can be used after a character or character set to indicate that one or more characters should be matched.

A final way to express quantity of characters is to use curly braces. If you would like to match exactly 3 of a character, you would use {3} after the character. So if you wanted to find an instance of three 0s in a row, you would do this:

[suso@antonio adv_shell]$ grep "0\{3\}" electricfence-readme 
        IBM AIX on the RS/6000.
[suso@antonio adv_shell]$

If you'd like to match more than 1 of a character but no more than 3, you can use {1,3} as the quantifier.

Of course we have to quote (backslash escape in this case) the curly braces so that they aren't interpreted by the shell itself.

The next special character we will use for our regular expression is the pipe character, which means logical OR. So by using this character it will match what is before the | character OR what is after.

[suso@antonio adv_shell]$ grep "This\|that" electricfence-readme 
This is Electric Fence 2.1
of a malloc() buffer. It will also detect any accesses of memory that has
Electric Fence stops your program on the first instruction that causes
This version will run on:
that apply to your particular system.
It will probably port to any ANSI/POSIX system that provides mmap(), and
[suso@antonio adv_shell]$

© 2000 Suso Banderas - suso@suso.org