About This
I am Michael Tyson, and I run A Tasty Pixel. I write on a variety of technology and software development topics as I travel around Europe in a motorhome.
-
Subscribe to updates 487 feed subscribers
Follow me on Twitter 862 followers
Our Products
Newsletter
Let us keep you informed about important updates, special offers, and new products. Just type in your email address below and hit enter to sign up!
-
Recent Posts
Topics
Audio Business Career Cocoa Code Data Debugging Demo Design Development Geocoding Google Graphics Interface iPad iPhone Lifestyle Links Location Loopy Loopy HD Mac Maps Marketing Networking PHP Scripts Security Shell Social Comment Software Talkie Talkie-for-Mac The Cartographer Travel Tutorial Twitter Update Web Webapps WordPress WordPress Plugins WordPress Themes Workflow XCode



Unix tools tutorial
Last week we took a quick look at shell scripting. This week we’ll finish off with a look at some useful tools – awk, sed, sort and uniq.
These are used mainly for manipulating text, and gathering information. Awk and sed provide a fairly rich language set, but can be used for very simple things too. Sort and uniq are simple utilities that carry out one operation – sorting, and extracting unique entries (among other things), respectively.
We’ll start simple, with sort and uniq.
Sort & Uniq
Sort is useful for putting items into order – it takes newline-separated text data on input, and outputs the data, sorted, in the order specified as parameters. When used with the ‘-r’ parameter, sort will perform a reverse order sort.
This tool is particularly useful when used with uniq, which requires that the input is sorted.
Uniq will remove duplicate entries from the input, display only duplicate entries (when the ‘-d’ parameter is used), or display a count of duplicates in the input (when used with ‘-c’). The latter function is quite useful for collecting data – I used it just the other day to examine my log file to block the IP addresses of some spammers. See the bottom of this post for more details.
Sed
Every Unix user’s friend, sed is great for performing replacements on input, among other things. Sed’s great for taking some input, applying a regex-based search-replace, and printing the result. It’s useful for removing unwanted characters from input, pruning input for further processing, and changing the order of fields (although awk can do this too).
For example, converting all files in the current directory from png to jpg:
This grabs all ‘png’ files in the current directory, and runs ‘convert’ with the original filename as the first argument, and the same filename with ‘png’ replaced with ‘jpg’, as the second argument.
So, a common usage of sed is:
This replaces the ‘search’ pattern with ‘replace’ in the input from ‘source’.
‘Search’ is a regular expression that specifies the text to have the transformation done to it. It can include parenthesis ‘(‘ and ‘)’ to designate captured text, which can then be referenced in the replacement through backreferences, which are denoted by ‘\n‘, where n is the index of the captured expression. This is a trivial example, which could probably be done in many ways, but:
This looks for a string of characters that is not a ‘<’ between a ‘<b>’ string (b can be lower or uppercase) and a ‘</b>’ string. This string is replaced with <H2>, then the text that was between the two <b> tags, and </H2>.
Tricky, but after some practice, sed is very useful.
Awk
Awk is a very full featured processing suite, which includes a fairly complex programming language. It allows for variables, arithmetic operations, fairly complicated search patterns, text manipulation functions, execution control statements – if, while, for, and so on.
There are a huge number of awk tutorials, for more information than this very quick intro.
So, an awk expression typically looks like:
…That is, one or more pattern-action pairs, where an empty pattern will match everything. Patterns can be ‘BEGIN’, which causes the associate actions to be executed before anything else (for variable initialisation, for example), ‘END’ which executes after everything else (for reporting, perhaps), or something like a regular expression, or other comparison expression.
Variables are treated much like c variables (‘var = “some value”; print var;’). There are some special variables defined in awk: NF, which gives the number of fields in the current record (that is, number of items separated by the field separator character on the current line of input), NR which gives the current record number (line number), FS, which gives the field separator character, which can also be set to any character, or in fact, any regular expression.
Note that the field separator can be set with the ‘-F’ parameter when calling awk.
Awk also defines some fairly similar functions to C: printf in particular, which formats a series of variables according to a given format expression.
One simple usage for awk is to extract (possibly format) fields from a file. The command:
…Extracts the IP address from an Apache log file. This can then be processed to extract statistics.
A more advanced example counts the number of times a certain IP address was logged:
This starts by setting an ‘ip’ variable, the IP to search for. Then, for every record (line) who’s first field matches that ip variable, the variable ‘x’ is incremented. At the end, it prints a message containing the IP address and the number of times it was found in the log.
Putting it together
Here’s an example of using sort, uniq, sed and awk (this may not be the optimal solution, but it works). I was being comment-spammed quite severely (as usual), and wanted to try blocking the offenders. I had a look in my log to grab out the offending IP addresses:
This performs the following:
The output looked something like:
I then blocked the first few IP addresses (which, as a side note, didn’t help reduce spamming – there appears to be an almost infinite pool of addresses which are used).
So, that’s the rundown on a few useful tools. They can save a lot of time, especially when combined with a bit of shell script framework. Try them out sometime!
Related posts