Advanced Shell Topics: Pipefitting

Pipefitting

All the commands that you have learned so far are fine when used on their own, but their real power becomes evident when you used them together. By piping the output from one command into the next, you can do just about anything with your data from the command line. To wrap up this discussion, I include some common uses for pipefitting.

The first one here is something that I tend to do a lot. The first column in an Apache web server log file is a record of the remote host name or ip address. I find it useful to know how many unique visitors are going to one of my websites from time to time. So to get that unique count, I use the following pipeline:

[user@host ~]$ cat suso.suso.org-access_log | awk {'print $1'} | sort | uniq | wc -l
   1583

Or if I want to find the unique visitors based on the domain name, you can split up the hostname to just get the top level domain part:

[user@host ~]$ cat suso.suso.org-access_log | awk {'print $1'} | awk -F. {'print $(NF-1) "." $NF'} | sort | uniq | wc -l
    734

The -F. option for awk specifies that you want to use a . as the field seperator. You can choose any argument you want for the -F option, including multiple characters. The $NF variable refers to the last field on the line. As you can see, I use the $NF as well as $(NF-1), which is just an expression meaning the next to last field. This is handy when the number of fields changes from one line to the next, but you know you always need the last or next to last field. Such is the case with a fully qualified host name.

Another useful statistic that I like to see is how many visits per hour of a certain day to a certain page of my site. Let's say I want to find out the number of unique host visits to my /docs/ directory versus the hour of the day:

[user@host ~]$ for i in `seq -w 0 23` ; do echo -n "$i: " ; grep " \[28/Dec/2003:$i:" suso.suso.org-access_log.2003.12 | grep "GET /docs/" | awk {'print $1'} | sort | uniq | wc -l ; done
00:       2
01:       0
02:       3
03:       3
04:       2
05:       3
06:       1
07:       0
08:       1
09:       1
10:       0
11:       0
12:       2
13:       0
14:       2
15:       2
16:       3
17:       3
18:       2
19:       1
20:       1
21:       5
22:       2
23:       3

The use of a for loop and the seq program, which creates a list of numbers for the for loop to go through. In this case, we want the hours of the day from 00 to 23. By using the -w option, we can ensure that our numbers from 0 to 9 are padded with a leading zero. The $i variable which gets replaced with a value from the for loop on every iteration is placed in the grep expression so that it can find all the lines for each hour of that day. You want to make sure you include enough in the matching expression to make sure that you don't match other lines that you shouldn't. For instance, if you just used ":$i:" as the matching expression, it might match lines for other hours that have something like "GET /somedirectory/index.cgi?value=32515:01:3515".

If you want an aggragate total for each hour for the whole month, just replace specific day used in the grep expression with '..':

[user@host ~]$ for i in `seq -w 0 23` ; do echo -n "$i: " ; grep " \[../Dec/2003:$i:" suso.suso.org-access_log.2003.12 | awk {'print $1'} | sort | uniq | wc -l ; done
00:      38
01:      40
02:      39
03:      42
04:      35
05:      30
06:      41
07:      32
08:      45
09:      51
10:      46
11:      44
12:      39
13:      54
14:      59
15:      64
16:      81
17:      59
18:      69
19:      53
20:      81
21:      57
22:      61
23:      54

Or you can use a double for loop to count through the hours per day and a few days. Be sure you use different variable names for each of your loops.

[user@host ~]$ for h in `seq 28 31` ; do echo "Dec $h:" ; for i in `seq -w 0 23` ; do echo -n "  $i: " ; grep "\[$h/Dec/2003:$i:" suso.suso.org-access_log.2003.12 | grep "GET /docs/" | awk {'print $1'} | sort | uniq | wc -l ; done ; done
Dec 28:             
  00:       0       
  01:       2       
  02:       1       
  03:       1       
  04:       2       
  05:       2       
  06:       2       
  07:       4       
  08:       3       
  09:       0       
  10:       1       
  11:       0       
  12:       1       
  13:       1       
  14:       0       
  15:       1       
  16:       2       
  17:       1       
  18:       1       
  19:       1       
  20:       0       
  21:       1       
  22:       1       
  23:       3       
Dec 29:             
  00:       1       
  01:       2       
  02:       2       
  03:       3       
  04:       1       
  05:       3       
  06:       3       
  07:       4       
  08:       0       
  09:       0       
  10:       4       
  11:       1       
  12:       1       
  13:       2       
  14:       4       
  15:       7       
  16:       6       
  17:       4       
  18:       1       
  19:       1       
  20:       1       
  21:       5       
  22:       3       
  23:       0       
Dec 30:             
  00:       3       
  01:       2       
  02:       1       
  03:       2       
  04:       1       
  05:       0       
  06:       2       
  07:       0
  08:       0
  09:       1
  10:       1
  11:       1
  12:       2
  13:       2
  14:       0
  15:       2
  16:       2
  17:       2
  18:       2
  19:       1
  20:       2
  21:       0
  22:       4
  23:       3
Dec 31:
  00:       1
  01:       4
  02:       2
  03:       3
  04:       3
  05:       1
  06:       1
  07:       0
  08:       0
  09:       2
  10:       0
  11:       2
  12:       1
  13:       1
  14:       2
  15:       0
  16:       5
  17:       4
  18:       2
  19:       2
  20:       0
  21:       1
  22:       1
  23:       1

Output like this would allow you to see trends in data over a short time.

Another frequency counting example is where you have a list of words, such as usernames on a Unix system. You want to find what the most common first letter is. So you use cut:

[user@host ~]$ cat /etc/passwd | cut -c -1 | sort | uniq -c | sort -n
      1 i
      1 l
      1 n
      1 o
      1 p
      1 q
      1 v
      1 z
      2 w
      3 e
      4 d
      4 k
      5 h
      5 r
      5 t
      6 a
      7 b
      9 c
      9 s
     10 j
     10 m

Some people might have thought to use awk to split the fields of the password file on the colon character, but that's a wasted step since cut is going to truncate everything but the first character anyways.