The man page for awk describes it as a "pattern scanning and processing language". There are indeed many things that you can do with awk, but we are only going to cover one of them, which is it's ability to split text into fields like a database.
awk syntax works like this, you give the command awk, then any options you want to use with it followed by curly braces containing the commands that you want to run on the input. Like this:
$ awk -F: {'print $1 " " $2'}
Let's say that we have an Apache log file and want to print only the first column for each entry, the remote host address for the request. This is the first column when using Common Logfile Format.
[user@host ~]$ tail access_log colosus.iucc.ac.il - - [13/Dec/2000:00:56:19 +0000] "GET /news2html/ HTTP/1.0" 404 635 "-" "Mozilla/3.01 (X11; I; SunOS 4.1.4 sun4m)" adsl-151-197-17-34.phila.adsl.bellatlantic.net - - [13/Dec/2000:01:34:51 +0000] "GET / HTTP/1.1" 404 2572 "-" "Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)" k0.fujitsu.co.jp - - [13/Dec/2000:01:50:35 +0000] "GET /news2html/groups/alt.ascii-art/2000011023390100.phtml HTTP/1.0" 200 2616 "http://www.google.com/search?q=ISO-88591-1" "Mozilla/4.73 [ja] (WinNT; U)" cache-db02.proxy.aol.com - - [13/Dec/2000:01:51:23 +0000] "GET /news2html/groups/alt.ascii-art/ HTTP/1.0" 200 495150 "http://google.yahoo.com/bin/query?p=how+to+make+a+home+made+cable+scramblers&hc=0&hs=0" "Mozilla/4.0" 1cust125.tnt10.phoenix.az.da.uu.net - - [13/Dec/2000:01:53:25 +0000] "GET / HTTP/1.1" 200 2572 "-" "Mozilla/4.0 (compatible; MSIE 5.5; MSNIA; Windows 98; thenewweb.com)" 1cust125.tnt10.phoenix.az.da.uu.net - - [13/Dec/2000:01:55:00 +0000] "GET /news2html/groups/alt.ascii-art/ HTTP/1.1" 200 81189 "-" "Mozilla/4.0 (compatible; MSIE 5.5; MSNIA; Windows 98; thenewweb.com)" cage.suso.org - - [13/Dec/2000:01:55:21 +0000] "GET /presentations/adv_shell/ HTTP/1.1" 200 1822 "-" "Mozilla/5.0 (X11; U; Linux 2.2.18 i586; en-US; m18)" cage.suso.org - - [13/Dec/2000:01:55:31 +0000] "GET /presentations/adv_shell/awksed.phtml HTTP/1.1" 200 2216 "http://suso.suso.org/presentations/adv_shell/" "Mozilla/5.0 (X11; U; Linux 2.2.18 i586; en-US; m18)" ai-209-247-40-220.alexa.com - - [13/Dec/2000:02:00:29 +0000] "GET //robots.txt HTTP/1.0" 404 551 "-" "ia_archiver" ai-209-247-40-220.alexa.com - - [13/Dec/2000:02:00:30 +0000] "GET /news2html/groups/alt.ascii-art HTTP/1.0" 301 326 "-" "ia_archiver" [user@host ~]$ tail access_log | awk {'print $1'} colosus.iucc.ac.il adsl-151-197-17-34.phila.adsl.bellatlantic.net k0.fujitsu.co.jp cache-db02.proxy.aol.com 1cust125.tnt10.phoenix.az.da.uu.net 1cust125.tnt10.phoenix.az.da.uu.net cage.suso.org cage.suso.org ai-209-247-40-220.alexa.com ai-209-247-40-220.alexa.com [user@host ~]$
Note that the $1 used in the curly braces is not the $1 that bash uses to signify the first argument passed to a script. This is an important difference to realize if you ever start using awk within scripts that take arguments.
By default awk will split fields of a line on a space character, so if we add more elements to the print statement for awk it will print the respective rows:
[user@host ~]$ tail access_log | awk {'print $1 " " $9 " " $10'} colosus.iucc.ac.il 404 635 dsl-151-197-17-34.phila.adsl.bellatlantic.net 404 647 k0.fujitsu.co.jp 200 2616 cache-db02.proxy.aol.com 200 495150 1cust125.tnt10.phoenix.az.da.uu.net 200 2572 1cust125.tnt10.phoenix.az.da.uu.net 200 81189 cage.suso.org 200 1822 cage.suso.org 200 2216 ai-209-247-40-220.alexa.com 404 551 ai-209-247-40-220.alexa.com 301 326 [user@host ~]$
sed is a great program for substituting text. Like awk, it does a lot more than just one function, but you'll probably end up using it a lot for making substitutions in text.
[user@host ~]$ cat names John Daggett, 341 King Road, Plymouth MA Alice Ford, 22 East Broadway, Richmond VA Orville Thomas, 11345 Oak Bridge Road, Tulsa OK Terry Kalkas, 402 Lans Road, Beaver Falls PA Eric Adams, 20 Post Road, Sudbury MA Hubert Sims, 328A Brook Road, Roanoke VA Amy Wilde, 334 Bayshore Pkwy, Mountain View CA Sal Carpenter, 73 6th Street, Boston MA
By using the 's' substitution command, we can substitute one string of characters with another. The 's' command takes two arguments delimited by '/' characters. What you are searching for goes in the first / / area and the second / / is the string that you want to replace it with. In this example, we substitute the two letter state abbreviations for MA and CA with their full name:
[user@host ~]$ cat names | sed 's/MA/Massachusetts/; s/CA/California/' John Daggett, 341 King Road, Plymouth Massachusetts Alice Ford, 22 East Broadway, Richmond VA Orville Thomas, 11345 Oak Bridge Road, Tulsa OK Terry Kalkas, 402 Lans Road, Beaver Falls PA Eric Adams, 20 Post Road, Sudbury Massachusetts Hubert Sims, 328A Brook Road, Roanoke VA Amy Wilde, 334 Bayshore Pkwy, Mountain View California Sal Carpenter, 73 6th Street, Boston Massachusetts
The previous example also shows how you can do multiple substitutions using the ; to seperate them. sed can also be useful is you want to fix syntax problems that stretch across all of your data. Notice that in the names file, the city doesn't have a comma after it, let's fix this:
[user@host ~]$ cat names | sed 's/ \([A-Z]\{2\}\)$/, \1/' John Daggett, 341 King Road, Plymouth, MA Alice Ford, 22 East Broadway, Richmond, VA Orville Thomas, 11345 Oak Bridge Road, Tulsa, OK Terry Kalkas, 402 Lans Road, Beaver Falls, PA Eric Adams, 20 Post Road, Sudbury, MA Hubert Sims, 328A Brook Road, Roanoke, VA Amy Wilde, 334 Bayshore Pkwy, Mountain View, CA Sal Carpenter, 73 6th Street, Boston, MA
The expression contains a few concepts that have been stated before but I will go over again. The first noticeable one is the use of the \( \) sequence. This will "remember" whatever is placed inside so that you can use it in the replacing expression. According to how many \( \) you used, you will need to use \1, \2, \3 and so on in the replacement expression.
The second thing is the character set [A-Z]. The [] expression means match any single character in that set. The A-Z part means that the set includes all characters from A through Z. Note, this is case sensitive and a-z will not match the same set of characters that A-Z will. If you just put [AZ] in, it will only match A or Z, not B, C, D, E, ... , W, X or Y.
The third part of the expression is the \{2\} part. This simply means the match must have 2 of the previous expression. The $ character at the end of the matching expression matches the end of the line.
Together this all means, match a space followed by exactly 2 characters that are from A through Z and are at the end of the line. In this file, as long as the format remains the same, it will only match the last field containing the state's abbreviation following the city name. So it is relatively easy to add the comma between those two fields.