Introduction to Awk

March 30, 2020

Awk is a processing language for data reading and manipulation. If you need to quickly process a text pattern inside a file, especially if your file contains rows and columns, awk might be the tool for the job.

Let's see some examples.

This command line kills the process running on localhost:3000 (don't worry trying to understand the code below. I will go over it later):

lsof -i:3000 | awk '/LISTEN/ {print $2}' | xargs kill -9

Let's do something simpler:

awk '{print}' server.rb

Displays file content, similar to cat server.rb. Awk also makes it easy to add filter. If you want display only lines that contains the word "run", you can do:

awk '/run/ {print}' server.rb

Very powerful. I am not even scratching the surface of the awk iceberg.

Basic Syntax

Awk's basic syntax is:

awk 'pattern {action}' file

One important action is print. Let's do some examples with print; I will go over pattern later.

For this, let's create a file called awk.ward (pun intended):

echo 'Awk. Or do not awk. There is no try' > awk.ward

To get the content of file, we can do:

awk '{print $0}' awk.ward

Let's try another print variation, this time we will hard code it:

awk '{print "Hello awk!"}' awk.ward

This prints "Hello awk!" regardless of what the file content is.

Fields

Earlier we saw:

awk '{print $0}' awk.ward

You may wonder, what $0 is. In awk,$0 represents the whole record match. Usually it is the entire line. You can do the same with a simple print statement ({print $0} is the same as {print}).

In addition, awk also captures different "fields" in a line. By default, it is delimited by space and tabs. Let's check out the fields:

awk '{print $0}' awk.ward
awk '{print $1}' awk.ward
awk '{print $2}' awk.ward
awk '{print $9}' awk.ward

My awk.ward file contains 1 line and 9 fields (each separated by space). If you ask awk to print fields higher than what awk captures (like field 10), it returns empty:

awk '{print $10}' awk.ward

You can change the delimiter with -F. In this case, we want to capture each field separated by ., not space. To tell awk to separate it with ., we use -F.:

awk -F. '{print $1}' awk.ward
awk -F. '{print $2}' awk.ward
awk -F. '{print $3}' awk.ward

You can also print multiple fields at once:

awk -F. '{print $2, $3, $1}' awk.ward 
## Or do not awk  There is no try Awk

Pattern matching

Recall our basic awk syntax: awk 'pattern {action}' file

Let's talk about pattern now. It accepts Basic regex rules. For example, to match any letters a-z:

awk '/[A-Za-z]+/ {print "I have string"}' awk.ward

To match integer (it won't display anything because there is no integer inside awk.ward):

awk '/[0-9]+/ {print "I have integer"}' awk.ward

If we create a new file, testFile.txt and inside we have:

1. This is first line
2. This is second line
3. This is third line
This is not part of the list

If we run awk '/[0-9]+/ {print}' testFile.txt, we get:

1. This is first line
2. This is second line
3. This is third line

Our command works as expected. It omits "This is not part of the list" because the last line does not contain any integer (/[0-9]+/).

Executing awk script from file

When our script grows too big, we can call awk command from script file.

Awk accepts -f to execute awk scripts. Let's create a script file and we will call it awk.script (you can name this anything):

## awk.script
/[0-9]+/ { print "I have integer" } 
/[A-Za-z]+/ { print "I have string" } 

Then run it against our awk.ward file: awk -f awk.script awk.ward

You'll see "I have string". It is expected, because our test file does not contain integer.

What do you think will print if we run it against our testFile.txt?

awk -f awk.script testFile.txt
I have integer
I have string
I have integer
I have string
I have integer
I have string
I have string

It returns what we expects. The first 3 lines contain both string and integer, so awk prints two lines for each match. The last one does not contain integer, so awk only prints string match output.

Chaining awk

In real life, I don't really use awk by itself that often. More often, I combine it with other commands.

Let's use the script earlier and break it down. Btw, if you are coding along, I have a server running on localhost:3000. Fire up a local server to see that awk actually kills it.

lsof -i:3000 | awk '/LISTEN/ {print $2}' | xargs kill -9

Let's walk through each step. lsof -i:3000 gives:

COMMAND   PID  USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
node    48523 iggy   27u  IPv4 0xe25443d27b90583f      0t0  TCP localhost:hbci (LISTEN)

lsof -i:3000 | awk '/LISTEN/ {print}' displays only the row with "LISTEN":

node    48523 iggy   27u  IPv4 0xe25443d27b90583f      0t0  TCP localhost:hbci (LISTEN)

Now we need to target the 2nd "field", because that's where our PID is. Modify our script to look for "LISTEN" pattern (lsof -i:3000 | awk '/LISTEN/ {print $2}'). This returns our PID:

48523

When we add xargs kill -9, it will pass the PID to kill -9, to terminate that PID. In this case, we need to use xargs to pipe the number so it becomes executable with kill -9. For more explanation, this SO post explains it well.

Begin, middle, end

An awk script consist of 3 parts: beginning, middle, and end. The beginning is performed once before processing any input. The middle is our main loop - everything that we've done up to this point are done in main loop. Most things in awk is done in this middle/ main loop. The end is processed once once main loop is finished.

  • BEGIN { # beginning script }
  • { # main input loop script }
  • END { # end script }

Suppose we have a file hello.txt with content:

Hello1
Hello2
Hello3
Hello4
Hello5
Hello6
Hello7
Hello8
Hello9
Hello10

And we run this:

awk 'BEGIN {print "BEGIN"} {print} END {print "END"}' hello.txt

We should expect 12 lines: 1 from BEGIN, 10 from main loop, and 1 from END. Our actual stdout:

BEGIN
Hello0
Hello1
Hello2
Hello3
Hello4
Hello5
Hello6
Hello7
Hello8
Hello9
Hello10
END

Exactly what is expected.

Field Separator

Recall that we can redefine delimiter/ field separator with -F. In awk, we can redefine field separator inside our script with built-in variable FS (Field Separator). The convention is to define it inside BEGIN - right before the file is read and processed.

For example, inside greetings.txt we have a text:

Hello, how are you, sire?

When we inspect the fields, they are separated by space.

awk '{print $1}' greetings.txt # Hello,
awk '{print $2}' greetings.txt # how
awk '{print $3}' greetings.txt # are
## ... and so on

We want to separate them by comma. Here is how you can redefine separator:

awk 'BEGIN {FS = "," } {print $2}' myFile.txt # how are you

Record Separator

By now, you can tell that awk performs operations line-wise. In awk, each line is a record. Each record contains multiple "fields", separated by tabs/ spaces (that we can change with -F or FS). What if we need to read chunks of multiple lines?

What if our data looks like users.txt below?

Iggy
Programmer 
123-123-1234

Yoda
Jedi Master
111-222-3333

We need to make the lines ranging from "Iggy" to "123-123-1234" one record, lines from "Yoda" to "111-222-3333" another record. How to tell awk to chunk our data for this structure?

Luckily, awk has a "Record Separator" (RS) to do this. By default, you can guess, the default record separator is newline (\n). Let's change that:

awk 'BEGIN {FS="\n"; RS=""} {print "Name:", $1; print "Rank:", $2; print "\n"}' users.txt

This returns:

Name: Iggy
Rank: Programmer

Name: Yoda
Rank: Jedi Master

Which is exactly what we expected. Now all $1 contain names, $2 ranks/titles, and $3 phone numbers.

How did it work?

  • We set our Field Separator (FS) from space/tabs default into newlines (\n). Now newline marks a different field, instead of new record.
  • We set our record separator into "" from newline default.

You may ask, how does making record separator "" make chunking above work? That doesn't make sense. Shouldn't we use RS = "\n\n+" for when we have two or more newlines?

Awk, when it sees RS equals to empty string ("") it interprets it as having records separated by one or more blank lines. Apparently it is quite common to have a record separated by blank lines that awk accepts RS="".

In other word, each record now is separated by a blank line. The next record starts after blank line.

This is
a record

This is another
record
separated by blank line

This is yet another record

For more information about this weird behavior, check out this link.

Conclusion

I think this is a good place to end. There are still much more features I didn't get to cover here: variables, conditionals, functions, etc. I will leave that for you.

Can you do what awk does with scripting language like Python or Ruby? Definitely. But, if you need something on-the-fly, awk might be a better choice. Plus it is included in most Unix-like operating system, so you don't need to install anything.

Do you need to know awk to be a good developer? Definitely not. I know many great developers who don't know awk. But knowing a little awk can be very helpful - it looks really cool.

Thanks for reading. Happy coding!

Resources