A Tasty Pixel » Posts

Incremental rsync backups

I’ve written a shell script to do incremental backups with rsync (inspired by this macosxhints article).

The script takes one or more sources, and a local destination path, and creates an incremental rsync mirror at the destination. It will first create an initial archive, then will subsequently create a new archive, hard-link the last archive in, and perform an rsync over it. Each archive will appear to be a full mirror, but will actually only use the disk space of the files that have changed, due to the use of hard links.

Here’s the script:

incremental_rsync.sh

And some uses of it:

incremental_rsync.sh [email protected]:www /Volumes/BackupDrive/Web

incremental_rsync.sh Documents Pictures Library /Volumes/BackupDrive/Home

Note that old backups can be safely deleted without disturbing more recent backups; as hard links are used, the data won’t be deleted until all links that reference it are removed.

Are you leaking search queries?

Recently, AOL leaked 20 million search queries to the world (as covered in a NY Times article). It listed search queries alongside user numbers, which links search terms to individuals. Although no names are listed, it is often not difficult to determine (or at least narrow down) an individual’s identity, given their search history, as demonstrated in the above article.

The privacy ramifications here are extremely worrying. Web searching is something we all do, and for some of us, often reveals all kinds of intimate details about our lives. That this search history is recorded at all is, if I may say so, an abomination; but that this was carelessly leaked by AOL is very worrying. And yes, even Google record search queries (they’re just a little more careful with it).

The appeal to government and ‘law enforcement’ groups is obvious, which makes this even more dangerous. While Google resisted the Justice Department’s subpoena, the other search engine’s capitulated.

Particularly worrisome is the international nature of many of these search engines, with data centres located outside regions that are protected by various privacy laws. Only recently, Yahoo cooperated with the Chinese government in revealing the identity of a Chinese journalist who had distributed a warning from the Chinese government about the reporting on sensitive local issues; the journalist is now serving a 10 year prison sentence.

Yahoo was forced by the local laws to cooperate with the Chinese government. While one may assume that this only affects local users, it’s quite common for data to be mirrored at several sites, ensuring adequate redundancy should a site be compromised and its data lost. Thus, it’s quite possible that Australian and American search history is stored in regions where the Government has complete control over access to this data.

The Chinese incident is sinister enough, but this is probably just the beginning.

Consequently, the EFF (Electronic Frontier Foundation) have published an article with a few notes on how to maximise your privacy when using search engines.

Some points are:

Don’t put personally-identifying information in your searches
Don’t log in to a search engine account
Don’t accept cookies from your search engine

They are worth reading through; in particular, there’s some directions for Firefox users on how to configure an extension to increase your anonymity with Google searches.

Even if you don’t follow those suggestions, I recommend regularly clearing out your Google cookies. These cookies provide Google and other search engines with a link between your search queries – an identifier that ties them together. By clearing these cookies out regularly, you sever the link between past queries and future queries. In particular, do this before embarking on a particularly sensitive search.

To remove cookies in Firefox:

Bring up Firefox Preferences (on a Mac, click the ‘Firefox’ menu at the top left, then ‘Preferences’)
Click the ‘Privacy’ icon
Click the ‘Cookies’ tab
Click ‘View Cookies’, bottom left
Type ‘google’ in the search bar, and remove all related cookies by selecting them and clicking ‘Remove Cookies’

In Safari:

Bring up Safari Preferences (‘Safari’ menu, ‘Preferences’)
Click the ‘Security’ icon
Click ‘Show Cookies’
Scroll down to the Google cookies, select them, and press ‘Remove’

Instructions for Internet Explorer are here (If you are using Internet Explorer, it is recommended that you switch to Firefox, which offers increased security and stability. Honestly.).

Alternatively, just delete all cookies, which probably can’t hurt.

Unix tools tutorial

Last week we took a quick look at shell scripting. This week we’ll finish off with a look at some useful tools – awk, sed, sort and uniq.

These are used mainly for manipulating text, and gathering information. Awk and sed provide a fairly rich language set, but can be used for very simple things too. Sort and uniq are simple utilities that carry out one operation – sorting, and extracting unique entries (among other things), respectively.

We’ll start simple, with sort and uniq.

Sort & Uniq

Sort is useful for putting items into order – it takes newline-separated text data on input, and outputs the data, sorted, in the order specified as parameters. When used with the ‘-r’ parameter, sort will perform a reverse order sort.

This tool is particularly useful when used with uniq, which requires that the input is sorted.

Uniq will remove duplicate entries from the input, display only duplicate entries (when the ‘-d’ parameter is used), or display a count of duplicates in the input (when used with ‘-c’). The latter function is quite useful for collecting data – I used it just the other day to examine my log file to block the IP addresses of some spammers. See the bottom of this post for more details.

Sed

Every Unix user’s friend, sed is great for performing replacements on input, among other things. Sed’s great for taking some input, applying a regex-based search-replace, and printing the result. It’s useful for removing unwanted characters from input, pruning input for further processing, and changing the order of fields (although awk can do this too).

For example, converting all files in the current directory from png to jpg:

$ find . -maxdepth 1 -type f -iname *.png -print0 | xargs -0 -i{} sh -c ‘convert “{}” “`echo {} | sed s/png/jpg/`”‘

This grabs all ‘png’ files in the current directory, and runs ‘convert’ with the original filename as the first argument, and the same filename with ‘png’ replaced with ‘jpg’, as the second argument.

So, a common usage of sed is:

$ source | sed s/search/replace/ | destination

This replaces the ‘search’ pattern with ‘replace’ in the input from ‘source’.

‘Search’ is a regular expression that specifies the text to have the transformation done to it. It can include parenthesis ‘(‘ and ‘)’ to designate captured text, which can then be referenced in the replacement through backreferences, which are denoted by ‘n‘, where n is the index of the captured expression. This is a trivial example, which could probably be done in many ways, but:

$ cat index.html | sed ‘s/<[bB]>([^<]*)</[bB]>/<H2>1</H2>/’

This looks for a string of characters that is not a ‘<‘ between a ‘<b>’ string (b can be lower or uppercase) and a ‘</b>’ string. This string is replaced with <H2>, then the text that was between the two <b> tags, and </H2>.

Tricky, but after some practice, sed is very useful.

Awk

Awk is a very full featured processing suite, which includes a fairly complex programming language. It allows for variables, arithmetic operations, fairly complicated search patterns, text manipulation functions, execution control statements – if, while, for, and so on.

There are a huge number of awk tutorials, for more information than this very quick intro.

So, an awk expression typically looks like:

PATTERN { Action; Action; Action } PATTERN2 { Action; action }

…That is, one or more pattern-action pairs, where an empty pattern will match everything. Patterns can be ‘BEGIN’, which causes the associate actions to be executed before anything else (for variable initialisation, for example), ‘END’ which executes after everything else (for reporting, perhaps), or something like a regular expression, or other comparison expression.

Variables are treated much like c variables (‘var = “some value”; print var;’). There are some special variables defined in awk: NF, which gives the number of fields in the current record (that is, number of items separated by the field separator character on the current line of input), NR which gives the current record number (line number), FS, which gives the field separator character, which can also be set to any character, or in fact, any regular expression.

Note that the field separator can be set with the ‘-F’ parameter when calling awk.

Awk also defines some fairly similar functions to C: printf in particular, which formats a series of variables according to a given format expression.

One simple usage for awk is to extract (possibly format) fields from a file. The command:

$ awk ‘{print $1}’ /var/log/httpd/access_log

…Extracts the IP address from an Apache log file. This can then be processed to extract statistics.

A more advanced example counts the number of times a certain IP address was logged:

$ awk ‘BEGIN { ip = “127.0.0.1” } $1 == ip {x++} END { print ip ” was logged ” x ” times” }’ /var/log/httpd/access_log

127.0.0.1 was logged 7 times

This starts by setting an ‘ip’ variable, the IP to search for. Then, for every record (line) who’s first field matches that ip variable, the variable ‘x’ is incremented. At the end, it prints a message containing the IP address and the number of times it was found in the log.

Putting it together

Here’s an example of using sort, uniq, sed and awk (this may not be the optimal solution, but it works). I was being comment-spammed quite severely (as usual), and wanted to try blocking the offenders. I had a look in my log to grab out the offending IP addresses:

$ grep comment.php public_html/mike/db/hits | tail -n 1000 | awk ‘{print $3}’ | sed s/:// | sort -n | uniq -c | sort -nr | less

This performs the following:

Searches public_html/mike/db/hits for the term ‘comment.php’ (which the spammers were attacking)
Grabs the last 1000 lines
Extracts the third field, which was the IP address
Removes the colon character, which was in the field
Performs a numeric sort on the IP addresses
Counts the number of repetitions of addresses
Sorts in reverse numeric order, so that the largest number of repetitions (repeat offenses) are displayed first
Displays the results in the ‘less’ viewer

The output looked something like:

37 200.88.223.98
20 202.83.174.66
13 213.203.234.141
12 62.221.222.5
11 222.236.34.90
10 192.104.37.15
10 165.229.205.91
 9 213.148.158.191
 9 193.227.17.30
 8 61.238.244.86
...

I then blocked the first few IP addresses (which, as a side note, didn’t help reduce spamming – there appears to be an almost infinite pool of addresses which are used).

So, that’s the rundown on a few useful tools. They can save a lot of time, especially when combined with a bit of shell script framework. Try them out sometime!

A brief shell scripting tutorial

Shell scripts are very useful things, whether they’re prepared and saved to a file for regular execution (for example, with a scheduler like cron), or just entered straight into the command line. They can perform tasks in seconds that may take days of repetitive work, like, say, resizing or touching up images, replacing text in a large number of HTML documents, converting items from one format to another, or gathering statistics.

This entry is a brief introduction to scripting using the Bash shell, which I find to be the most intuitive, and probably the most common (that said, most of the information here will apply to other shells). We will explore some of the basic building blocks of scripting, such as while and for loops, if statements, and a few common techniques for accomplishing tasks. Along the way, we’ll also take a look at some of the tools that make shell scripting a bit more useful, such as test, and bc. Finally, we’ll put it together with a couple of examples.

Stay tuned over the next few days for a brief tutorial on some very useful unix tools, like awk, sed, sort and uniq, which may become indispensable to you (they have for me).

But first, lets begin with some basics.

Scripting 101

Bash (and it’s siblings) is a lot more than just a launcher; it is more like a programming language interface, which allows users to enter very complicated commands to achieve a wide variety of tasks. The language used is much like any other programming language – it contains for and while loops, if statements, functions and variables.

Commands can be either entered straight into the command line, or saved to files for execution.

Script files invariably begin with a line that’s commonly known as the shbang (hash-bang, referring to the first two characters):

#!/bin/sh

This is a directive that’s read by the shell when a script file is executed – it tells the shell what to use to interpret the commands within. In this case, the /bin/sh application will be used. This is the most common shbang; it can be replaced with #!/usr/bin/perl for perl scripts, or #!/usr/bin/php for php scripts, too.

After the shbang comes the script itself – a series of commands, which will be executed by the interpreter. Comments can be entered in to make the script more readable; these are prefaced by a hash symbol:

# This is a comment

When creating a new script file, I find it easiest to set it as executable, so it can be run by just entering the script name. Alternatively, the script has to be run as a parameter to the interpreter (such as ‘sh script.sh‘). This is annoying, so make the file executable with:

chmod +x script.sh

Now the boring stuff’s covered, lets move on to the basic code structures!

Holding and manipulating values

Variables are used to hold values for use later, and are accessed by a dollar symbol, followed by the variable name. When defining the values of variables, the dollar sign is not used at all. For example:

count=2;

echo $count;

Note particularly the absence of spaces around the equals sign in the first line above. This is required – putting spaces in (like ‘count = 2’) will cause a syntax error.

Numeric variables can have arithmetic operations performed on them using the $((…)) syntax. This allows for simple integer addition, subtraction, division and multiplication. Operations can be combined, and brackets can be used to form complex expressions. For example:

count=$(($count+1));

product=$((count*8));

complex=$(((product+2)*($count-4)));

Note the first line of the previous example – the simple increment. This is quite useful for performing loops with a counter (we’ll have a look at loops soon).

For performing more complicated arithmetic, the ‘bc‘ tool is quite handy. bc is an arbitrary precision calculator language interpreter, and provides basically any mathematical function that could possibly be required.

To use bc, simply ‘pipe’ commands into it, and grab the result on bc’s stdout:

$ echo ‘8/3’ | bc -l

2.66666666666666666666

$ echo ‘a(1)*4’ | bc -l

3.14159265358979323844

$ pi=`echo ‘a(1)*4’ | bc -l`

$ echo $pi

3.14159265358979323844

Note the ‘-l’ parameter to bc – this defines the standard mathlib library, which contains some useful functions (like arctan, or ‘a’, used above). The parameter also makes bc use floating-point numbers by default (without it, bc will only give integer results).

Command-line parameters

Often you will want your shell scripts to take parameters, which modify the behaviour of the script. They can specify a file on which to operate, or a number of times to iterate over a loop, for example. This essentially just passes in a variable into the script, which can then be used.

Command line arguments appear as numbered variables. $0 denotes the command that was run (your script’s name, typically). After that, the arguments to the command are given, as $1, $2, $3, onwards.

For example, the script:

#!/bin/sh

echo $0 utility.

echo Arguments are:

echo $1 – first argument

echo $2 – second argument

echo $3 – third argument

Can be executed with:

$ ./test.sh a b c

./test.sh utility.

Arguments are:

a – first argument

b – second argument

c – third argument

Arguments can also be referred to en masse with the $* special variable, which returns a string containing all arguments.

See ‘Iterating over command-line arguments’ for notes on how to use this.

Making decisions

Making decisions in a script is a very useful thing to be able to do – it can allow you to take actions depending on whether a command succeeded or failed, or it can allow you to perform an action only if it’s applicable (like only backing up to an external drive if it’s plugged in!).

If statements are formatted thus:

if test; then

   do_something;

elif test2; then

   do_something_else;

else

   do_something_completely_different;

fi;

The ‘elif‘ statement is optional, and can be omitted. It can also be duplicated – like any if statement in any other language, you can have as many elif’s as you like.

Note that this can also go on one line. For example: if test; then do_something; else do_something_else; fi

The statement above performs test; if test succeeded, then do_something will be executed. Otherwise, test2 is performed. If that succeeds, do_something_else is executed. Otherwise, do_something_completely_different is executed.

The test in an if statement is a command that is executed; the value returned from the command is used to make the decision.

All command-line applications return a numerical value (this can be any integer value), which is usually utilised to indicate the status of the command upon exiting. A value of zero is usually used to indicate success. A non-zero value usually indicates failure.

You can observe the value returned by a command by using the $? variable immediately after the command exits:

$ ping nosuchhost

ping: cannot resolve nosuchhost: Unknown host

$ echo $?

68

$ ping -c 1 google.com

PING google.com (72.14.207.99): 56 data bytes

64 bytes from 72.14.207.99: icmp_seq=0 ttl=233 time=258.205 ms

— google.com ping statistics —

1 packets transmitted, 1 packets received, 0% packet loss

round-trip min/avg/max/stddev = 258.205/258.205/258.205/0.000 ms

$ echo $?

0

The if construct tests whether the returned value is zero or nonzero. If it’s zero, the test passes. So, we could write:

if ping -c 1 google.com; then

echo ‘Google is alive. All is well.’;

else

echo ‘Google down! The world is probably about to end’.

fi;

Tests can also be performed in-line, allowing commands to be strung together. The && joiner, placed between commands, tells the shell to execute the right-hand command only if the left-hand command succeeds:

ifconfig ppp0 && echo ‘PPP Connection Alive.’

The || joiner performs similarly, but will only execute the right-hand command if the left-hand command fails:

ifconfig ppp0 || redial_connection

Commands can be grouped in these structures, and strung together – for example, a series of commands that must be executed in sequence, and only if all preceding commands succeed too. Commands can be grouped in brackets, to form fairly complex statements:

$ true && (echo 1 && echo 2 && (true || echo 3) && echo 4) || echo 5

1

2

4

$ false && (echo 1 && echo 2 && (true || echo 3) && echo 4) || echo 5

5

$ true && (echo 1 && echo 2 && (false || echo 3) && echo 4) || echo 5

1

2

3

4

Testing, testing

Now we’ve seen how to act upon the results of a test, it’s a good time to introduce the test utility itself.

test is an application that is used to perform a wide variety of tests on strings, numbers, files. Expressions can be combined to construct fairly complex tests. By way of example, lets look at a few uses of test:

$ test ‘Hello’ = ‘Hello’ && echo ‘Yes’ || echo ‘No’

Yes

$ test ‘Hello’ = ‘Goodbye’ && echo ‘Yes’ || echo ‘No’

No

$ test 2 -eq 2 && echo ‘Yes’ || echo ‘No’

Yes

$ test 2 -lt 20 && echo ‘Yes’ || echo ‘No’

Yes

$ test 2 -gt 20 && echo ‘Yes’ || echo ‘No’

No

$ test -e /etc/passwd && echo ‘Yes’ || echo ‘No’

Yes

See the test manual page for more information.

To perform arithmetic tests on floating point values, the bc tool steps in again (as ‘test’ will only operate on integers):

$ test `echo ‘3.4 > 3.1’ | bc` -eq 1 && echo Yes || echo No

Yes

$ test `echo ‘3.4 > 3.6’ | bc` -eq 1 && echo Yes || echo No

No

Note particularly the single quotes around the ‘>’ expression: without this, the meaning of the expression changes (the value ‘3.4’ will be redirected into the file ‘3.1’ or ‘3.6’).

If a bc expression evaluates to true, bc returns ‘1’. Otherwise, bc returns ‘0’.

For code readability, the test utility is aliased to ‘[‘, and will ignore the ‘]’ character. Thus, test can be used in commands like:

if [ $count -gt 4 ]; then

take_action;

fi;

Gone loopy

Iterating over commands can be great for performing tasks on a large number of items. There are two loop types defined, while and for loops.

While loops

While loops have the following structure:

while test; do

command_1;

command_2;

done;

Here, test is the same as that from if statements (see above). Note the placement of semicolons – after the test, and before the do, in particular.

Note that, like all script elements, while loops can be used on one line, for quick entry on the command line: while test; do command_1; command_2; done;

While loops, like their counterparts in other programming languages, will continue executing until test evaluates to false.

For loops

For loops are defined thus:

for var in set; do

command_1;

command_2;

done;

For loops are used to iterate over a set of values, defined here in set. The variable var is used to iterate over the set: For each iteration, var is set to the next value within set.

Set is a whitespace-delimited string, containing a list of items. For example:

for dir in Documents Pictures Library Music; do

cp -a $dir /backup;

done;

for image in Images/*.jpg; do

convert “$image” -scale 640x480 “$image-scaled.jpg”;

done;

Escaping

To break out of a while or for loop, the ‘break’ command is used. To continue onto the next iteration, thereby skipping the rest of the statements in the loop body, the ‘continue’ command is used. For example:

count=0;

while [ $count -lt 100 ]; do   # Iterate 100 times

   if [ $count -eq 2 ]; then   # Skip the 2nd iteration

      continue;

   fi;

   do_stuff || break;          # Stop iterating if do_stuff fails

   count=$((count+1));         # Increment ‘count’

done;

Iterating over files

Lets direct our attention to that second-last example:

for image in Images/*.jpg; do

convert “$image” -scale 640x480 “$image-scaled.jpg”;

done;

Note that this will only function correctly if none of the files in ‘Images’ have spaces in their name. As this is a rather dangerous assumption, we best avoid it when we can.

To be honest, I haven’t discovered a way to make this work on files with spaces. Instead, I tend to use the ‘find‘ tool with ‘xargs‘ to perform commands.

The ‘find’ tool will return a list of files that match the provided pattern. The ‘xargs’ utility performs a set of commands on each item it receives as input. We can put the two together with:

find -maxdepth 1 -type f -print0 | xargs -0 -i{} sh -c ‘echo Working on file {}.’

This example finds all files (-type f) in the current directory (-maxdepth 1), and then xargs prints ‘Working on file <filename>.’ for each one.

The -print0 argument to find forces the utility to delimit files with a ‘null’ character instead of the default, newline. This makes for safer filename handling. It has to be used with the -0 argument in xargs, which will use null character as the delimiter in the input.

The -i{} parameter tells xargs to use the ‘{}’ sequence to denote where the filename should be placed in the command. Arguments afterwards are executed. The argument “sh -c ‘echo Working on file {}.” here will make the shell execute the echo command.

Note that the echo command could be used without ‘sh’, like: xargs -0 -i{} echo Working on file {}.

This is fine if only one command is used. However, if more than one command is to be executed, or more complex commands are to be used, these commands need to be interpreted with ‘sh’. As xargs is just a simple execution tool, it doesn’t understand shell scripts.

Thus, complex statements can be put together. For example (note that this is one command spread across two lines):

find -maxdepth 1 -type f -print0 | xargs -0 -i{} sh -c ‘echo Working on file {}.; copy_file_to_server {} || echo Upload of {} failed.’

Iterating over command-line arguments

Often, you will want to make shell scripts take a series of arguments that are then iterated over. For example, a script may take a list of images to manipulate, or text files to edit.

Such a utility would be invoked with:

$ my_script.sh *.jpg

If any of the arguments had spaces in them (in this case, for example, a jpg called ‘My Trip.jpg’), this can be a little tricky to handle.

Although the arguments would be passed correctly (that is, one of the arguments would indeed contain the text ‘My Trip.jpg’), it is difficult to iterate over them correctly. If a for loop were to be used:

for img in $*; do

manipulate_image $img;

done;

…Spaces within filenames would cause problems. In our example, instead of ‘My Trip.jpg’ being passed to manipulate_img, it would be split – first ‘My’ would be passed to manipulate_img, followed by ‘Trip.jpg’! Nasty.

A technique I often use is to make use of the shift command, which discards the first argument, and moves all other arguments down one. This is a more robust technique:

while [ “$1” ]; do

manipulate_image $1;

shift;

done;

This will take the first argument, act upon it, then move the next argument down for the next loop.

The loop will finish when there are no more arguments, and “$1” will return an empty string, which evaluates to ‘false’.

Final words

That’s about it for this brief tutorial. Hopefully you have enough to start assembling scripts and powerful commands to help you out. There’s a huge amount more to know about shell scripting though – arrays, clever variable manipulation, and plenty more stuff that I’m entirely unaware of, I’m sure. If you want to know more, just do some Googling for shell scripting – there’s an insanely large number of resources out there.

Stay tuned over the next couple of days – I’ll post a brief guide to using some fairly nice tools, like awk, sed, uniq and sort. These little rippers are fantastic for manipulating text and gathering statistics. Trust me, once you know how to use them, you’ll use them all the time (I do!).

For now, I’ll leave you with a final example – this is a small script I wrote the other day to replace the ‘rm’ command, and move all ‘deleted’ items to the trash, instead of just deleting them outright. Here it is:

#!/bin/sh

if [ “$1” = ‘-rf’ -o “$1” = ‘-r’ ]; then

   shift;

   recursive=true;

fi;

while [ “$1” ] ; do

   # If not recursive, skip directories

   if [ -d “$1” -a ! “$recursive” ]; then

      echo $0: $1: is a directory; shift; continue;

   fi;

   [ ! -d ~/.Trash/“`pwd`” ] && mkdir -p ~/.Trash/“`pwd`”;

   mv “$1” ~/.Trash/“`pwd`”;

   shift;

done;

Idiot-proof ‘rm’

After my unfortunate accident, I decided to spend a moment trying to make sure it doesn’t happen again.

Thus, the (semi) idiot-proof ‘rm’ was born:

This is a small script that replaces ‘rm’. Instead of deleting, it moves the files in question to ~/.Trash.

It’ll create a directory structure under ~/.Trash that echoes that of the current working directory, so it’s easy to see where files came from.

Here’s the script:

#!/bin/sh

if [ “$1” = ‘-rf’ -o “$1” = ‘-r’ ]; then

   shift;

   recursive=true;

fi;

while [ “$1” ] ; do

   # If not recursive, skip directories

   if [ -d “$1” -a ! “$recursive” ]; then

      echo $0: $1: is a directory; shift; continue;

   fi;

   [ ! -d ~/.Trash/“`pwd`” ] && mkdir -p ~/.Trash/“`pwd`”;

   mv “$1” ~/.Trash/“`pwd`”;

   shift;

done;

trash.sh

To use it, save it into a file (I use ~/.trash.sh), then save and set it to executable. Then, open up ~/.bash_profile, and add the line:

alias rm=’~/.trash.sh’

That will automatically be run when you log in. To make it work in the current login, just use:

source ~/.bash_profile

…Or type the alias command straight in.

URLs, the last great ‘what the’

Happily surfing the Microsoft website today (don’t ask), I was amused to note the ridiculous URLs in use – Here are some examples:

http://www.microsoft.com/ downloads/details.aspx? FamilyId=4C254E3F-79D5-4012- 8793-D2D180A42DFA &displaylang=en
http://www.microsoft.com/ downloads/Browse.aspx? displaylang=en&productID= 4289AE77-4CBA-4A75- 86F3-9FF96F68E491
http://www.microsoft.com/ downloads/info.aspx?na=63&p=& SrcDisplayLang=en& SrcCategoryId=&SrcFamilyId= 9996B314-0364-4623-9EDE- 0B5FBB133652&u=%2f genuine%2fdownloads%2f WhyValidate.aspx%3ffamilyid %3d9996B314-0364-4623- 9EDE-0B5FBB133652 %26displaylang%3den

Whatever happened to friendly URLs? If I was going to point someone to an article or a download from Microsoft’s website, I’d need several weeks just to recite it! (I will admit that Apple is no better – take their link to the MacBook Pro on their store site: http://store.apple.com/133-622/WebObjects/australiastore.woa/ 80505/ wo/3s3D4l85iljb2mPXrPH2pCTDMcy /0.SLID? nclm=MacBookPro&mco=7C576790)

Crazy long URLs force site users to work through the navigation instead of being able to point each other to pages: Imagine reciting such a URL over the phone – it would never happen – "slash, 3, lowercase s, 3, capital D, 4, lowercase l, 8, 5…". Instead, one would tend to point to apple.com, and give directions from there. There is also no way anyone could work out what each URL points to by looking at it. It’s crazy!

In the case of the four URLs above, these really should be something like:

http://www.microsoft.com/downloads/ Worldwide_English/ActiveSync_4.1
http://www.microsoft.com/office
http://www.microsoft.com/Windows_Genuine_Advantage
http://store.apple.com/au/MacBookPro

These days, with ‘404 handlers’ and such things in common use (this site uses one!), it really is very easy to make user-friendly URLs. Having a decent URL for a site’s users means they’re more likely to be able to point each other to a site (Keep It Simple, Stupid), and thus more likely to bring more visitors to the site.

URL handlers are very easy to write – using Apache, one just needs a .htaccess file sitting in the webroot, which directs all URLs to a handler page:

RewriteEngine On RewriteBase / RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule (.*) /index.php

Then, have a page (in this case, index.php) which processes the URL and provides an appropriate page. SiteComponents has:

// Get URL request $s = substr($_SERVER["REQUEST_URI"], strrpos($_SERVER["SCRIPT_NAME"], "/")+1); // Strip off GET parameters and anchors if ( strpos($s, "?") !== false ) $s = substr($s, 0, strpos($s, "?")); if ( strpos($s, "#") !== false ) $s = substr($s, 0, strpos($s, "#")); // Run site $site->Run(urldecode($s));

The ‘Run’ method within $site will then handle the URL, and return an appropriate page (or suggest one if no exact match is found).

Fractal image coding

I’m studying for a Digital Video Coding and Compression exam, and just wanted to share the coolness that is fractal coding.

Fractals are geometric objects which are said to have infinite complexity, and are often self-similar (that is, a small portion of the fractal is the same as the whole fractal). Iterated Function Systems (IFS) are one kind of fractal. An example is the Sierpinski gasket. Zoom into any of the sub-triangles, and they look just like the original.

It’s formed by making three ‘copies’ of an original image (it doesn’t matter what the image is), and placing them in the shape of a triangle; one above, one bottom-left, one bottom-right. Repeat an infinite number of times:

p align=”center”>

The same process, with an altered replication pattern, can be applied to form other shapes, such as a fern leaf (which uses 4 copies of the previous iteration’s image: One thin one for the stem, two rotated shrunk ones for the fronds coming off the sides, and one for the rest of the fern). Such patterns are often observed in nature, which makes fractals particularly interesting.

Fractal image coding applies a similar concept. Parts of the image are used to build up other parts (imagine an image of a tree; A bit of a branch is copied and shrunk, to form the image of a smaller branch coming off the first one). This process of searching for similar parts within the image, and using them to form the current part, is repeated over the whole image.

To reconstruct the image, the same process as generating the Sierpinski gasket is used: Bits of the starting pattern (which, amazingly, can be anything) are copied, scaled, coloured and rotated, as instructed by the encoding process. The whole process is repeated until the image is built:

Oh, it’s such black magic.

Unfortunately, the whole process is incredibly slow, because so much searching is required during the encoding process (Looking for that branch part that looks like the current part of the image being encoded). It also doesn’t quite give enough performance to make the extra work worthwhile. Still, it definitely performs well in some circumstances; maybe it will make a comeback when we have faster computers.

NB. Most of the images are nicked shamelessly from Dr. Tim Ferguson’s lecture notes on fractal coding, from CSE5302. Also, for further reading, Mirsad suggested this paper: A Review of the Fractal Image Coding Literature, by Brendt Wohlberg and Gehard de Jager.