Unix tools tutorial

Last week we took a quick look at shell scripting. This week we’ll finish off with a look at some useful tools – awk, sed, sort and uniq.

These are used mainly for manipulating text, and gathering information. Awk and sed provide a fairly rich language set, but can be used for very simple things too. Sort and uniq are simple utilities that carry out one operation – sorting, and extracting unique entries (among other things), respectively.

We’ll start simple, with sort and uniq.

Sort & Uniq

Sort is useful for putting items into order – it takes newline-separated text data on input, and outputs the data, sorted, in the order specified as parameters. When used with the ‘-r’ parameter, sort will perform a reverse order sort.

This tool is particularly useful when used with uniq, which requires that the input is sorted.

Uniq will remove duplicate entries from the input, display only duplicate entries (when the ‘-d’ parameter is used), or display a count of duplicates in the input (when used with ‘-c’). The latter function is quite useful for collecting data – I used it just the other day to examine my log file to block the IP addresses of some spammers. See the bottom of this post for more details.

Sed

Every Unix user’s friend, sed is great for performing replacements on input, among other things. Sed’s great for taking some input, applying a regex-based search-replace, and printing the result. It’s useful for removing unwanted characters from input, pruning input for further processing, and changing the order of fields (although awk can do this too).

For example, converting all files in the current directory from png to jpg:

$ find . -maxdepth 1 -type f -iname \*.png -print0 | xargs -0 -i{} sh -c ‘convert “{}” “`echo {} | sed s/png/jpg/`”‘

This grabs all ‘png’ files in the current directory, and runs ‘convert’ with the original filename as the first argument, and the same filename with ‘png’ replaced with ‘jpg’, as the second argument.

So, a common usage of sed is:

$ source | sed s/search/replace/ | destination

This replaces the ‘search’ pattern with ‘replace’ in the input from ‘source’.
‘Search’ is a regular expression that specifies the text to have the transformation done to it. It can include parenthesis ‘(‘ and ‘)’ to designate captured text, which can then be referenced in the replacement through backreferences, which are denoted by ‘\n‘, where n is the index of the captured expression. This is a trivial example, which could probably be done in many ways, but:

$ cat index.html | sed ‘s/<[bB]>$[^<]*$<\/[bB]>/<H2>\1<\/H2>/’

This looks for a string of characters that is not a ‘<‘ between a ‘<b>’ string (b can be lower or uppercase) and a ‘</b>’ string. This string is replaced with <H2>, then the text that was between the two <b> tags, and </H2>.

Tricky, but after some practice, sed is very useful.

Awk

Awk is a very full featured processing suite, which includes a fairly complex programming language. It allows for variables, arithmetic operations, fairly complicated search patterns, text manipulation functions, execution control statements – if, while, for, and so on.

There are a huge number of awk tutorials, for more information than this very quick intro.

So, an awk expression typically looks like:

PATTERN { Action; Action; Action } PATTERN2 { Action; action }

…That is, one or more pattern-action pairs, where an empty pattern will match everything. Patterns can be ‘BEGIN’, which causes the associate actions to be executed before anything else (for variable initialisation, for example), ‘END’ which executes after everything else (for reporting, perhaps), or something like a regular expression, or other comparison expression.

Variables are treated much like c variables (‘var = “some value”; print var;’). There are some special variables defined in awk: NF, which gives the number of fields in the current record (that is, number of items separated by the field separator character on the current line of input), NR which gives the current record number (line number), FS, which gives the field separator character, which can also be set to any character, or in fact, any regular expression.

Note that the field separator can be set with the ‘-F’ parameter when calling awk.

Awk also defines some fairly similar functions to C: printf in particular, which formats a series of variables according to a given format expression.

One simple usage for awk is to extract (possibly format) fields from a file. The command:

$ awk ‘{print $1}’ /var/log/httpd/access_log

…Extracts the IP address from an Apache log file. This can then be processed to extract statistics.
A more advanced example counts the number of times a certain IP address was logged:

$ awk ‘BEGIN { ip = “127.0.0.1” } $1 == ip {x++} END { print ip ” was logged ” x ” times” }’ /var/log/httpd/access_log
127.0.0.1 was logged 7 times

This starts by setting an ‘ip’ variable, the IP to search for. Then, for every record (line) who’s first field matches that ip variable, the variable ‘x’ is incremented. At the end, it prints a message containing the IP address and the number of times it was found in the log.

Putting it together

Here’s an example of using sort, uniq, sed and awk (this may not be the optimal solution, but it works). I was being comment-spammed quite severely (as usual), and wanted to try blocking the offenders. I had a look in my log to grab out the offending IP addresses:

$ grep comment.php public_html/mike/db/hits | tail -n 1000 | awk ‘{print $3}’ | sed s/:// | sort -n | uniq -c | sort -nr | less

This performs the following:

Searches public_html/mike/db/hits for the term ‘comment.php’ (which the spammers were attacking)
Grabs the last 1000 lines
Extracts the third field, which was the IP address
Removes the colon character, which was in the field
Performs a numeric sort on the IP addresses
Counts the number of repetitions of addresses
Sorts in reverse numeric order, so that the largest number of repetitions (repeat offenses) are displayed first
Displays the results in the ‘less’ viewer

The output looked something like:

37 200.88.223.98
20 202.83.174.66
13 213.203.234.141
12 62.221.222.5
11 222.236.34.90
10 192.104.37.15
10 165.229.205.91
 9 213.148.158.191
 9 193.227.17.30
 8 61.238.244.86
...

I then blocked the first few IP addresses (which, as a side note, didn’t help reduce spamming – there appears to be an almost infinite pool of addresses which are used).

So, that’s the rundown on a few useful tools. They can save a lot of time, especially when combined with a bit of shell script framework. Try them out sometime!

Shell