Advanced Shell Topics: awk and sed

awk

The man page for awk describes it as a "pattern scanning and processing language". There are indeed many things that you can do with awk, but we are only going to cover one of them, which is it's ability to split text into fields like a database.

awk syntax works like this, you give the command awk, then any options you want to use with it followed by curly braces containing the commands that you want to run on the input. Like this:

	$ awk -F: {'print $1 " " $2'}

Let's say that we have an Apache log file and want to print only the first column for each entry, the remote host address for the request. This is the first column when using Common Logfile Format.

[user@host ~]$ tail access_log
colosus.iucc.ac.il - - [13/Dec/2000:00:56:19 +0000] "GET /news2html/ HTTP/1.0" 404 635 "-" "Mozilla/3.01 (X11; I; SunOS 4.1.4 sun4m)"
adsl-151-197-17-34.phila.adsl.bellatlantic.net - - [13/Dec/2000:01:34:51 +0000] "GET / HTTP/1.1" 404 2572 "-" "Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)"
k0.fujitsu.co.jp - - [13/Dec/2000:01:50:35 +0000] "GET /news2html/groups/alt.ascii-art/2000011023390100.phtml HTTP/1.0" 200 2616 "http://www.google.com/search?q=ISO-88591-1" "Mozilla/4.73 [ja] (WinNT; U)"
cache-db02.proxy.aol.com - - [13/Dec/2000:01:51:23 +0000] "GET /news2html/groups/alt.ascii-art/ HTTP/1.0" 200 495150 "http://google.yahoo.com/bin/query?p=how+to+make+a+home+made+cable+scramblers&hc=0&hs=0" "Mozilla/4.0"
1cust125.tnt10.phoenix.az.da.uu.net - - [13/Dec/2000:01:53:25 +0000] "GET / HTTP/1.1" 200 2572 "-" "Mozilla/4.0 (compatible; MSIE 5.5; MSNIA; Windows 98; thenewweb.com)"
1cust125.tnt10.phoenix.az.da.uu.net - - [13/Dec/2000:01:55:00 +0000] "GET /news2html/groups/alt.ascii-art/ HTTP/1.1" 200 81189 "-" "Mozilla/4.0 (compatible; MSIE 5.5; MSNIA; Windows 98; thenewweb.com)"
cage.suso.org - - [13/Dec/2000:01:55:21 +0000] "GET /presentations/adv_shell/ HTTP/1.1" 200 1822 "-" "Mozilla/5.0 (X11; U; Linux 2.2.18 i586; en-US; m18)"
cage.suso.org - - [13/Dec/2000:01:55:31 +0000] "GET /presentations/adv_shell/awksed.phtml HTTP/1.1" 200 2216 "http://suso.suso.org/presentations/adv_shell/" "Mozilla/5.0 (X11; U; Linux 2.2.18 i586; en-US; m18)"
ai-209-247-40-220.alexa.com - - [13/Dec/2000:02:00:29 +0000] "GET //robots.txt HTTP/1.0" 404 551 "-" "ia_archiver"
ai-209-247-40-220.alexa.com - - [13/Dec/2000:02:00:30 +0000] "GET /news2html/groups/alt.ascii-art HTTP/1.0" 301 326 "-" "ia_archiver"
[user@host ~]$ tail access_log | awk {'print $1'}
colosus.iucc.ac.il
adsl-151-197-17-34.phila.adsl.bellatlantic.net
k0.fujitsu.co.jp
cache-db02.proxy.aol.com
1cust125.tnt10.phoenix.az.da.uu.net
1cust125.tnt10.phoenix.az.da.uu.net
cage.suso.org
cage.suso.org
ai-209-247-40-220.alexa.com
ai-209-247-40-220.alexa.com
[user@host ~]$

Note that the $1 used in the curly braces is not the $1 that bash uses to signify the first argument passed to a script. This is an important difference to realize if you ever start using awk within scripts that take arguments.

By default awk will split fields of a line on a space character, so if we add more elements to the print statement for awk it will print the respective rows:

[user@host ~]$ tail access_log | awk {'print $1 " " $9 " " $10'}
colosus.iucc.ac.il 404 635
dsl-151-197-17-34.phila.adsl.bellatlantic.net 404 647
k0.fujitsu.co.jp 200 2616
cache-db02.proxy.aol.com 200 495150
1cust125.tnt10.phoenix.az.da.uu.net 200 2572
1cust125.tnt10.phoenix.az.da.uu.net 200 81189
cage.suso.org 200 1822
cage.suso.org 200 2216
ai-209-247-40-220.alexa.com 404 551
ai-209-247-40-220.alexa.com 301 326
[user@host ~]$

sed

sed is a great program for substituting text. Like awk, it does a lot more than just one function, but you'll probably end up using it a lot for making substitutions in text.

[user@host ~]$ cat names
John Daggett, 341 King Road, Plymouth MA
Alice Ford, 22 East Broadway, Richmond VA
Orville Thomas, 11345 Oak Bridge Road, Tulsa OK
Terry Kalkas, 402 Lans Road, Beaver Falls PA
Eric Adams, 20 Post Road, Sudbury MA
Hubert Sims, 328A Brook Road, Roanoke VA
Amy Wilde, 334 Bayshore Pkwy, Mountain View CA
Sal Carpenter, 73 6th Street, Boston MA

By using the 's' substitution command, we can substitute one string of characters with another. The 's' command takes two arguments delimited by '/' characters. What you are searching for goes in the first / / area and the second / / is the string that you want to replace it with. In this example, we substitute the two letter state abbreviations for MA and CA with their full name:

[user@host ~]$ cat names | sed 's/MA/Massachusetts/; s/CA/California/'
John Daggett, 341 King Road, Plymouth Massachusetts
Alice Ford, 22 East Broadway, Richmond VA
Orville Thomas, 11345 Oak Bridge Road, Tulsa OK
Terry Kalkas, 402 Lans Road, Beaver Falls PA
Eric Adams, 20 Post Road, Sudbury Massachusetts
Hubert Sims, 328A Brook Road, Roanoke VA
Amy Wilde, 334 Bayshore Pkwy, Mountain View California
Sal Carpenter, 73 6th Street, Boston Massachusetts

The previous example also shows how you can do multiple substitutions using the ; to seperate them. sed can also be useful is you want to fix syntax problems that stretch across all of your data. Notice that in the names file, the city doesn't have a comma after it, let's fix this:

[user@host ~]$ cat names | sed 's/ \([A-Z]\{2\}\)$/, \1/'
John Daggett, 341 King Road, Plymouth, MA
Alice Ford, 22 East Broadway, Richmond, VA
Orville Thomas, 11345 Oak Bridge Road, Tulsa, OK
Terry Kalkas, 402 Lans Road, Beaver Falls, PA
Eric Adams, 20 Post Road, Sudbury, MA
Hubert Sims, 328A Brook Road, Roanoke, VA
Amy Wilde, 334 Bayshore Pkwy, Mountain View, CA
Sal Carpenter, 73 6th Street, Boston, MA

The expression contains a few concepts that have been stated before but I will go over again. The first noticeable one is the use of the  sequence. This will "remember" whatever is placed inside so that you can use it in the replacing expression. According to how many  you used, you will need to use \1, \2, \3 and so on in the replacement expression.

The second thing is the character set [A-Z]. The [] expression means match any single character in that set. The A-Z part means that the set includes all characters from A through Z. Note, this is case sensitive and a-z will not match the same set of characters that A-Z will. If you just put [AZ] in, it will only match A or Z, not B, C, D, E, ... , W, X or Y.

The third part of the expression is the \{2\} part. This simply means the match must have 2 of the previous expression. The $ character at the end of the matching expression matches the end of the line.

Together this all means, match a space followed by exactly 2 characters that are from A through Z and are at the end of the line. In this file, as long as the format remains the same, it will only match the last field containing the state's abbreviation following the city name. So it is relatively easy to add the comma between those two fields.