Awk is a processing language for data reading and manipulation. If you need to quickly process a text pattern inside a file, especially if your file contains rows and columns, awk might be the tool for the job.
Let's see some examples.
This command line kills the process running on localhost:3000 (don't worry trying to understand the code below. I will go over it later):
lsof -i:3000 | awk '/LISTEN/ {print $2}' | xargs kill -9
Let's do something simpler:
awk '{print}' server.rb
Displays file content, similar to cat server.rb
. Awk also makes it easy to add filter. If you want display only lines that contains the word "run", you can do:
awk '/run/ {print}' server.rb
Very powerful. I am not even scratching the surface of the awk iceberg.
Basic Syntax
Awk's basic syntax is:
awk 'pattern {action}' file
One important action is print
. Let's do some examples with print; I will go over pattern later.
For this, let's create a file called awk.ward
(pun intended):
echo 'Awk. Or do not awk. There is no try' > awk.ward
To get the content of file, we can do:
awk '{print $0}' awk.ward
Let's try another print variation, this time we will hard code it:
awk '{print "Hello awk!"}' awk.ward
This prints "Hello awk!" regardless of what the file content is.
Fields
Earlier we saw:
awk '{print $0}' awk.ward
You may wonder, what $0
is. In awk,$0
represents the whole record match. Usually it is the entire line. You can do the same with a simple print statement ({print $0}
is the same as {print}
).
In addition, awk also captures different "fields" in a line. By default, it is delimited by space and tabs. Let's check out the fields:
awk '{print $0}' awk.ward
awk '{print $1}' awk.ward
awk '{print $2}' awk.ward
awk '{print $9}' awk.ward
My awk.ward
file contains 1 line and 9 fields (each separated by space). If you ask awk to print fields higher than what awk captures (like field 10), it returns empty:
awk '{print $10}' awk.ward
You can change the delimiter with -F
. In this case, we want to capture each field separated by .
, not space. To tell awk to separate it with .
, we use -F.
:
awk -F. '{print $1}' awk.ward
awk -F. '{print $2}' awk.ward
awk -F. '{print $3}' awk.ward
You can also print multiple fields at once:
awk -F. '{print $2, $3, $1}' awk.ward
## Or do not awk There is no try Awk
Pattern matching
Recall our basic awk syntax:
awk 'pattern {action}' file
Let's talk about pattern now. It accepts Basic regex rules. For example, to match any letters a-z:
awk '/[A-Za-z]+/ {print "I have string"}' awk.ward
To match integer (it won't display anything because there is no integer inside awk.ward
):
awk '/[0-9]+/ {print "I have integer"}' awk.ward
If we create a new file, testFile.txt
and inside we have:
1. This is first line
2. This is second line
3. This is third line
This is not part of the list
If we run awk '/[0-9]+/ {print}' testFile.txt
, we get:
1. This is first line
2. This is second line
3. This is third line
Our command works as expected. It omits "This is not part of the list"
because the last line does not contain any integer (/[0-9]+/
).
Executing awk script from file
When our script grows too big, we can call awk command from script file.
Awk accepts -f
to execute awk scripts. Let's create a script file and we will call it awk.script
(you can name this anything):
## awk.script
/[0-9]+/ { print "I have integer" }
/[A-Za-z]+/ { print "I have string" }
Then run it against our awk.ward
file: awk -f awk.script awk.ward
You'll see "I have string"
. It is expected, because our test file does not contain integer.
What do you think will print if we run it against our testFile.txt
?
awk -f awk.script testFile.txt
I have integer
I have string
I have integer
I have string
I have integer
I have string
I have string
It returns what we expects. The first 3 lines contain both string and integer, so awk prints two lines for each match. The last one does not contain integer, so awk only prints string match output.
Chaining awk
In real life, I don't really use awk by itself that often. More often, I combine it with other commands.
Let's use the script earlier and break it down. Btw, if you are coding along, I have a server running on localhost:3000
. Fire up a local server to see that awk actually kills it.
lsof -i:3000 | awk '/LISTEN/ {print $2}' | xargs kill -9
Let's walk through each step. lsof -i:3000
gives:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
node 48523 iggy 27u IPv4 0xe25443d27b90583f 0t0 TCP localhost:hbci (LISTEN)
lsof -i:3000 | awk '/LISTEN/ {print}'
displays only the row with "LISTEN":
node 48523 iggy 27u IPv4 0xe25443d27b90583f 0t0 TCP localhost:hbci (LISTEN)
Now we need to target the 2nd "field", because that's where our PID is. Modify our script to look for "LISTEN"
pattern (lsof -i:3000 | awk '/LISTEN/ {print $2}'
). This returns our PID:
48523
When we add xargs kill -9
, it will pass the PID to kill -9
, to terminate that PID. In this case, we need to use xargs to pipe the number so it becomes executable with kill -9. For more explanation, this SO post explains it well.
Begin, middle, end
An awk script consist of 3 parts: beginning, middle, and end. The beginning is performed once before processing any input. The middle is our main loop - everything that we've done up to this point are done in main loop. Most things in awk is done in this middle/ main loop. The end is processed once once main loop is finished.
BEGIN { # beginning script }
{ # main input loop script }
END { # end script }
Suppose we have a file hello.txt
with content:
Hello1
Hello2
Hello3
Hello4
Hello5
Hello6
Hello7
Hello8
Hello9
Hello10
And we run this:
awk 'BEGIN {print "BEGIN"} {print} END {print "END"}' hello.txt
We should expect 12 lines: 1 from BEGIN, 10 from main loop, and 1 from END. Our actual stdout:
BEGIN
Hello0
Hello1
Hello2
Hello3
Hello4
Hello5
Hello6
Hello7
Hello8
Hello9
Hello10
END
Exactly what is expected.
Field Separator
Recall that we can redefine delimiter/ field separator with -F
. In awk, we can redefine field separator inside our script with built-in variable FS
(Field Separator). The convention is to define it inside BEGIN
- right before the file is read and processed.
For example, inside greetings.txt
we have a text:
Hello, how are you, sire?
When we inspect the fields, they are separated by space.
awk '{print $1}' greetings.txt # Hello,
awk '{print $2}' greetings.txt # how
awk '{print $3}' greetings.txt # are
## ... and so on
We want to separate them by comma. Here is how you can redefine separator:
awk 'BEGIN {FS = "," } {print $2}' myFile.txt # how are you
Record Separator
By now, you can tell that awk performs operations line-wise. In awk, each line is a record. Each record contains multiple "fields", separated by tabs/ spaces (that we can change with -F
or FS
). What if we need to read chunks of multiple lines?
What if our data looks like users.txt
below?
Iggy
Programmer
123-123-1234
Yoda
Jedi Master
111-222-3333
We need to make the lines ranging from "Iggy" to "123-123-1234" one record, lines from "Yoda" to "111-222-3333" another record. How to tell awk to chunk our data for this structure?
Luckily, awk has a "Record Separator" (RS
) to do this. By default, you can guess, the default record separator is newline (\n
). Let's change that:
awk 'BEGIN {FS="\n"; RS=""} {print "Name:", $1; print "Rank:", $2; print "\n"}' users.txt
This returns:
Name: Iggy
Rank: Programmer
Name: Yoda
Rank: Jedi Master
Which is exactly what we expected. Now all $1
contain names, $2
ranks/titles, and $3 phone numbers.
How did it work?
- We set our Field Separator (
FS
) from space/tabs default into newlines (\n
). Now newline marks a different field, instead of new record. - We set our record separator into
""
from newline default.
You may ask, how does making record separator ""
make chunking above work? That doesn't make sense. Shouldn't we use RS = "\n\n+"
for when we have two or more newlines?
Awk, when it sees RS
equals to empty string (""
) it interprets it as having records separated by one or more blank lines. Apparently it is quite common to have a record separated by blank lines that awk accepts RS=""
.
In other word, each record now is separated by a blank line. The next record starts after blank line.
This is
a record
This is another
record
separated by blank line
This is yet another record
For more information about this weird behavior, check out this link.
Conclusion
I think this is a good place to end. There are still much more features I didn't get to cover here: variables, conditionals, functions, etc. I will leave that for you.
Can you do what awk does with scripting language like Python or Ruby? Definitely. But, if you need something on-the-fly, awk might be a better choice. Plus it is included in most Unix-like operating system, so you don't need to install anything.
Do you need to know awk to be a good developer? Definitely not. I know many great developers who don't know awk. But knowing a little awk can be very helpful - it looks really cool.
Thanks for reading. Happy coding!
Resources
- Is there still any reason to learn AWK?
- Linux and Unix xargs command tutorial with examples
- Multiple-Line Records
- Why is xargs necessary?
- The A-Z of programming languages: AWK
- Robbins, Arnold and Dale Dougherty. Sed & awk: UNIX Power Tools.
man awk