If you have spent any significant amount of time working in or around a terminal, you have likely come across some of the “Basic Unix Tools”. To the scriptkids of today, these executable programs appear to be esoteric and outdated. They are but tools of a bygone era in which people were forced to do things in a odd ways. If the folks who developed these programs had access to fully-fledged IDEs or some of the rather wonderful applications which are central to the modern developer’s workflow, surely they would have no need for these old school tools. The tools of Unix are Esoteric? Perhaps. They are not exactly “new”, certainly. Outdated or unnecessary? Far from it. The wizards that constructed these utilities and mastered their usage certainly thought about computation and programming in a way entire different from the perspective we have towards computers and computation today. Learning about these tools gives you a glimpse of the processes which ran on their brains in the prior century and just might help you become better at doing interesting things.
In this post I will attempt to introduce interested readers to Awk
, one of the many command-line programs which make up the Unix Command Line Utilities.
Enter Awk
Awk is an interpreted programming language, developed in 1977 by Alfred Aho, Brian Kernighan and Peter Weinberger whilst at AT&T Bell Laboratories. It was named through combining the creator’s last names and for the rather cute and awfully dead ‘Auk’ bird. It was developed as most good things are, for solving some pesky problems that plagued the authors. In their case in 77’ they wanted a tool that was able to wrangle data and the outputs of existing Unix Utilities in useful ways. It is sort of a programmable glue for sticking together text in meaningful ways. The original awk
was rather bare-bones, relegated to one-liner programs. It evolved into a proper programming language a few years after its original development, out of organic interest in the tool. Awk became the language of choice for many folks at Bell Laboratories, given it’s simplicity, readability and composability with respect to other UNIX utilities – these traits led folks to avoid more capable tools/languages in favor of pushing the less capable Awk beyond what it was originally scoped for doing. Given this interest in the language, the team packed in more features, like user-defined functions, in to the tool in 1985, making Awk into a fully fledged programming language that still sees use in 2023.
The language can be thought of as a “generalization” of grep
and sed
, built to better handle structured data and do interesting things with the help of variables, arrays, complex functions through reusable scripting. Awk is the embodiment of the ideas behind grep
and sed
– since in awk
regular expressions are a fundamental aspect of the language, but you have more control over how data should be handled or read in. You also have the flexibility to write logic at the command line or in full files, unifying one-off tasks and complex scripts within the same syntax. When you are using something like sed
for “a quick little thing” that evolves into something a bit too complex to easily complete, you need to swap your tool heavier one, awk
is a tool suited for the light and heavy wrangling tasks you’ll have. (even if it isn’t always the most performant)
In some sense, it kind of feels like python
before python was invented, as it is a cross platform scripting language that looks a fair bit like pseudocode. The difference between awk
and pseudocode though, is that awk
programs are executable! Awk was developed with the intention of being “useful” from the start. This intention come through in the way one uses the language – it feels like an honest, good-faith best effort programming language and it succeeds in its endeavor to be “useful”.
What can you Awk?
The background behind the language is fascinating, but what can we actually do with Awk? In short, a fair bit, but we need to understand how the tool works before we go around attempting to hammer screws in. In Awk we run programs at the command line with the awk
command. One simple program might look as follows:
awk '$1 >= 8 { print $1 }' employee_pay_rates.txt
Where employee_pay_rates.txt
is a text file with data like:
12.00 Marissa
1.00 Kevin
8.50 Pascal
We run the specified program (string inside of ' '
quotes) on some input (the employee_pay_rates.txt
file here) and then receive the following results:
12.00
8.50
Translating this snippet into English; we find all lines in the employee_pay_rates.txt
file where the pay rate (here field 1, $1
) is greater than or equal to 8.00 – on these lines where this condition holds true, we print the pay rate for that line.
One modification we could make to the program to coax more interesting outputs would be to have it print the Employee’s Name in addition to their pay rate:
awk '$1 >= 8 { print $2 " makes " $1 " per hour" }' employee_pay_rates.txt
which modifies our output, returning:
Marissa makes 12.00 per hour
Pascal makes 8.50 per hour
This program, whilst trivial, displays the core loop underpinning all Awk programs. We look at some “input” and check for “pattern” matches, then we take “actions” based on the patterns we encounter. This Inputs-->Patterns-->Actions
loop holds for effectively all Awk
programs. This standard way of handling, running and thinking about Awk programs is one of its delightful simplicities. We can hold this execution loop as constant in our minds whenever we work with awk
and focus our attention on the logic we want to implement.
Personally, I think that understanding the general layout of a system is the first step towards making changes or doing interesting things with that system. If you have no bearings where should you go?
In the context of Awk, we can assume that programs will generally look like the following:
# Section to execute before seeing any inputs
BEGIN {
# Define our Field Separator (FS) and Record Separator (RS)
FS = "\t"
RS = "\n"
}
{
# Print out some formatted information
printf("\n%s%s %s has field 1 set to- '%s' \n",
FILENAME, "--Record NO.", FNR, $1)
printf(" & it has %s fields\n", NF)
continents[$4] += 1
}
# We can use regex directly as patterns
/North/ { print " this line has the substring 'North' on it" }
# Multiple Patterns can be matched for any given input line
/Asia/ || /Europe/ { print " this country is from the old world "} #one of the two patterns must be matched
$3 >= 250 && $2 >= 200 { print " this is one large country..."} #both patterns must be matched
# Section that awk executes after iterating over all of the inputs
END {
print "this data file has: ", length(continents), "unique continents"
# Iterate over an associative array and print some values
for (c in continents){
print c, continents[c]
}
# Unix ~= Home of Hacker, in AWK we can easily invoke other commands from inside a script
dircontents = system("ls -l")
}
Some additional notes might aid in understanding this simple program:
- The notion of “Input” is fundamental to
awk
, as the language was developed in and intended to be used for the Unix-y piping of stdouts- It is fair to say that most Awk programs revolve around a central loop over provided “Input”, so every program one constructs with Awk has a “for” loop at its core
- Here “Input” refers to the source of data we supply to an Awk program, in a call to
awk
like:awk -f prog logs.txt
our input islogs.txt
.- We loop over all of the “records” (substrings, split by the variable “RS”) in the file and for each record Awk parses the “fields” for it (substrings of the record, split apart via the character defined in “FS”)
- Awk has two “special” sections in a program’s execution, the BEGIN and END blocks.
- BEGIN blocks contain code that is to be executed before iterating over inputs in Awk
- END blocks and their code are to be executed after completing the iteration over all inputs
- one can include multiple “BEGIN” or “END” code sections in their program, all of which get concatenated into a single block in the order they are defined in at runtime
- “fields” are those odd numbered things; Field 1 in a record is referred to in an Awk program via
$1
, field 2 is$2
and field N is$N
- Awk has a max limit on the number of fields a “record” can have, which may vary by
awk
distribution and machine
- Awk has a max limit on the number of fields a “record” can have, which may vary by
- This language comes with a plethora of built-in variables,
FILENAME, FS, etc.
which one can assume are always available for use. - We can “define” variables rather magically, by conjuring them into existence simply by “using” them.
- This is demonstrated with the var
continents
in the above program, simply using it as an Array in Awk is sufficient to declare it, no need for init’ing variables as we customarily do in other languages. - The type of a variable is determined by the way in which you use it, or in other words the type is determined by the context. This type coercion automagically handles swapping between types.
- All arrays in Awk are associative, (think python dictionary or JavaScript Object) and can be indexed either with; strings or numbers (all strings under the hood)
- This is demonstrated with the var
- Patterns play a very central role in the control flow of an Awk program, we can put multiple patterns together to match inputs we would like to take a certain action for.
If we were to invoke this program on a data file, say “countries” with the following contents:
USSR 8649 275 Asia
Canada 3852 25 North America
China 3705 1032 Asia
USA 3615 237 North America
Brazil 3286 134 South America
India 1267 746 Asia
Mexico 762 78 North America
France 211 55 Europe
Japan 144 120 Asia
Germany 96 61 Europe
England 94 56 Europe
We would get this output, most of which isn’t exactly sensical but the illustrative portion of this example is the above code:
DATA/countries--Record NO. 1 has field 1 set to- USSR
& it has 5 fields
this country is from the old world
this is one large country...
DATA/countries--Record NO. 2 has field 1 set to- Canada
& it has 5 fields
this line has the substring North on it
DATA/countries--Record NO. 3 has field 1 set to- China
& it has 5 fields
this country is from the old world
this is one large country...
DATA/countries--Record NO. 4 has field 1 set to- USA
& it has 5 fields
this line has the substring North on it
DATA/countries--Record NO. 5 has field 1 set to- Brazil
& it has 5 fields
DATA/countries--Record NO. 6 has field 1 set to- India
& it has 5 fields
this country is from the old world
this is one large country...
DATA/countries--Record NO. 7 has field 1 set to- Mexico
& it has 5 fields
this line has the substring North on it
DATA/countries--Record NO. 8 has field 1 set to- France
& it has 5 fields
this country is from the old world
DATA/countries--Record NO. 9 has field 1 set to- Japan
& it has 5 fields
this country is from the old world
DATA/countries--Record NO. 10 has field 1 set to- Germany
& it has 5 fields
this country is from the old world
DATA/countries--Record NO. 11 has field 1 set to- England
& it has 5 fields
this country is from the old world
this data file has: 4 unique continents
South America 1
North America 3
Asia 4
Europe 3
total 16048
drwxr-xr-x@ 10 ckg staff 320 Aug 26 13:18 DATA
-rw-r--r--@ 1 ckg staff 1077 Jul 1 14:07 LICENSE
-rw-r--r--@ 1 ckg staff 6316 Aug 26 15:44 README.md
drwx--x--x@ 72 ckg staff 2304 Aug 26 15:40 SAMPLES
-rw-r--r--@ 1 ckg staff 8204128 Jul 1 14:07 The_AWK_Programming_Language.pdf
The above syntax and general flow constitutes the jist of Awk. The language is made powerful through using this set of primitives in conjunction with other programs at the command line. In the spirit of sensibility, Awk programs are run with one of the following two methods.
- Invoked directly on a program string:
awk 'BEGIN { print $0 }' inputdata
- Invoked through referencing an Awk program file (commonly called “progfile”):
awk -f progfile inputdata
Where progfile
contains the same code as in the single quote string of the first bullet point and inputdata
is a file which might contain the text:
what a lovely moon.
oh wait, you're supposed to say "hello world"
And that is the introduction to the language! Awk comes in a few different flavours (as do all things Open Source) but the above syntax should carry over to the varietals. One descriptive list of the instantiations of the language can be found on the wiki page. To conclude, lets mess around with a useful program.
Problem: How many commits have I made to main
on a local git repo?
Solution ala Awk:
BEGIN {FS = "\t"}
match($1, "([^<]*)>") {
email = substr($1, RSTART, RLENGTH-1)
committers[email] += 1
}
END {
print length(committers) " -- putting up commits in '" FILENAME "'"
for (person in committers) {
print " " person " made " committers[person] " commits."
}
}
Which one might store in prog
and run it inside a git repo, at the root dir as follows:
awk -f progfile .git/logs/refs/heads/main
Which for a particularly gawk-able repo on my local system returns:
1 -- putting up commits in '.git/logs/refs/heads/main'
chris@lindy.ai made 38 commits.
What we do in this program, is specify a field separator that splits the related datafile in to two nice fields. Then we find the unique contributors for this local branch via parsing out the unique emails with a Regex match pattern, keeping track of the number of commits associated with that email address. Finally once we have finished iterating over the entire file we dump the data we care about to stdout.
This program took me about 5-ish min to whip up, and I am a complete novice. Imagine what a full-fledged Wizard could procure?
Where to learn a bit more
Its rather crazy that this language made in the 70s is still relevant today, both as a tool to do good work and as a learning opportunity. It is not a general-purpose solution as is “python”, but the ideas behind awk
are worth walking through and sitting with.
That walk with Awk might start with The Awk Programming Language or with this repository I threw together as I went through the book; containing runable example programs along with the data necessary to experiment with them.
Learning Awk provides you an opportunity to get intimately familiar with regular expressions in an sensible way and I believe it opens the door to the idea that your operating system, is your standard library. An empowering perspective for the budding scriptkid.
As we part ways, id like to share a snippet I found moving from my read of “The Awk Programming Language”. First, think about how trivial it is to git clone
some software. It is rather easy to get access to troves of useful magic on the internet with a protocol like this. To get your hands on Awk when the authors dropped the book, these were the suggested instructions:
“This version of awk is part of Unix System V Release 3.1. Source code for this version is also available through AT&T’s Unix System Toolchest software distribution system; call 1-201-522-6900 and log in as g u e s t . In Europe, con-tact AT&T Unix Europe in London (44-1-567-7711); in the Far East, contact AT&T Unix Pacific in Tokyo (81-3-431-3670).”
That is mental, have a wonderful day.