Organize data on-the-go with Linux sort

June 6, 2020

Imagine working in a galaxy far far away for Lord Vader. One day he gives you a long list of plain text file and you need to sort them quickly or he will force choke you. All you have is a terminal. What would you do?

Luckily, the force is with you and your terminal probably has a sort method to sort data. Here is what you will learn:

Basic sort

Suppose you have a file with droid names (droid.txt):

AZI-3
2-1B
Buzz Droid
IG-88
Battle Droid
Droideka
C-3PO
R2-D2

To do basic sort, you do:

sort droid.txt

On terminal it returns:

2-1B
AZI-3
Battle Droid
Buzz Droid
C-3PO
Droideka
IG-88
R2-D2

If you notice, sort doesn't actually change droid.txt. If you want to mutate the original file, use -o (for output):

sort droid.txt -o droid.txt

Personally, I don't like mutating any file. I like to keep the result separate from original file, so I would save it inside a different file:

sort droid.txt -o droid-sorted.txt

Basic sort rules

Let's discuss how sort sort things. There are probably more rules than I have below, but these should be enough to get started.

  1. Lowercase letters have higher priorities than the same the letters that are uppercased.
  2. Letters are sorted in alphabetical order, example: "a" comes before "b", "c" comes before "x".
  3. Numbers come before letters.
  4. When first characters on both lines are similar, sort will sort them on the next difference. Meaning if you have:
racekar
racecar

The first 4 characters are the same: "r-a-c-e". The 5th letters are not: "k" vs "c". Sort will use this difference to determine who goes first.

Reverse sort

To reverse the order, you can use -r (reverse) option.

sort -r droid.txt

You'll get:

R2-D2
IG-88
Droideka
C-3PO
Buzz Droid
Battle Droid
AZI-3
2-1B

Numerical sort

Suppose you have number.txt containing:

234
39
1000
59
7

When you run sort number.txt, you get:

1000
234
39
59
7

This is probably not what you expect. When sort looks at the first character on each line, it sees "2,3,1,5,9". It sorts them into "1,2,3,5,7", which are technically the correct sorting order. You need to tell sort to sort them numerically with -n (numeric sort) option:

sort -n number.txt

This is more like it:

7
39
59
234
1000

Alternatively, you can use its longform option:

sort --sort=numeric number.txt

Month sort

Sort is also capable of sorting months. For example, if you have months.txt:

February
December
June
August
January

Normal sort would just sort based ontheir first characters. You need to tell them that you have month objects that needed sorting with -M (month) or --sort=month:

sort -M months.txt

Result:

January
February
June
August
December

Random sort

Sort can also create randomness instead of order. You can use -R or --sort=random. Suppose you have abc.txt containing already sorted list:

At-At Walker
Boba Fett
C3-P0
Darth Vader
Ewok

If you run:

sort -R abc.txt

You get a randomized result:

Boba Fett
At-At Walker
Ewok
Darth Vader
C3-P0

Running it multiple times will give different results each time.

Unique sort

Sometimes you get duplicate items. Sort has an option to remove duplicates: -u (unique).

Suppose you have this list with duplicates:

2-1B
AZI-3
Jar Jar Binks
Battle Droid
Buzz Droid
C-3PO
Jar Jar Binks
Droideka
IG-88
Jar Jar Binks
R2-D2

The list has 3 Jar Jar Binks. Let's remove the duplicates:

sort -u droid.txt

You'll get:

2-1B
AZI-3
Battle Droid
Buzz Droid
C-3PO
Droideka
IG-88
Jar Jar Binks
R2-D2

Much better. There is only 1 Jar Jar Binks now.

Key sort

One powerful sort feature is that it can sort based on column "key". Suppose you have a list of legendary basketball players and their jersey number inside basketball.txt:

Shaquille O'Neal 34
Kobe Bryant 8
Magic Johnson 32
Kareem Abdul-Jabbar 33
Michael Jordan 23
Stephen Curry 30

Running regular sort:

sort basketball.txt

You get a list sorted by first name:

Kareem Abdul-Jabbar 33
Kobe Bryant 8
Magic Johnson 32
Michael Jordan 23
Shaquille O'Neal 34
Stephen Curry 30

What if you want to sort them based on last name? To tell sort to use the second column (or "field") as the sorting basis, use -k option. Since the last names are on the second column, you can use -k 2.

sort -k 2 basketball.txt

You get:

Kareem Abdul-Jabbar 33
Kobe Bryant 8
Stephen Curry 30
Magic Johnson 32
Michael Jordan 23
Shaquille O'Neal 34

Holy flexibility! That's awesome. These options can be stacked. What if you need to sort them based on their jersey number? Don't forget, you can't just do sort -k 3 basketball.txt, because sort can't tell that you need to sort them numerically. Instead, you tell it to "sort based on 3rd column numeric values":

sort -k 3n basketball.txt

You get:

Kobe Bryant 8
Michael Jordan 23
Stephen Curry 30
Magic Johnson 32
Kareem Abdul-Jabbar 33
Shaquille O'Neal 34

There you go. Much better. Rest in peace, Kobe. You are the legend.

By the way, if you want to sort them based on jersey number, but in reverse, you can do:

sort -k 3nr basketball.txt

Extra: more on keys

If you read the man page (man sort), you'll notice that on -k, it says:

-k field1[,field2]

What is this field1 and field2 from manual page?

In the context of -k field1,field2, it allows us to use field1 as starting position and field2 as ending position. When I first read it, it didn't ring a bell - until I see it in action. Let's do some examples. Suppose you have this list misc.txt:

Hullo a4 almost s1me
Hello c2 almost s3me
Hillo b3 almost s2me
Hallo d1 almost s4me

Let's run a different sort commands and observe what they return:

sort misc.txt

Returns:

allo d1 almost s4me
Hello c2 almost s3me
Hillo b3 almost s2me
Hullo a4 almost s1me

It sees the "a,e,i,u" in "Hello, Hello, Hillo, Hullo" and sort them.

sort -k 2 misc.txt

Returns:

Hullo a4 almost s1me
Hillo b3 almost s3me
Hello c2 almost s2me
Hallo d1 almost s4me

It sorts based on second column. "a,b,c,d" in "a4, b3, c2, d1".

sort -k 3 misc.txt

Returns:

Hullo a4 almost s1me
Hillo b3 almost s2me
Hello c2 almost s3me
Hallo d1 almost s4me

It sorts based on 3rd column as the start. Since 3rd column is the same all across ("almost"), it goes to the next column and found that "s1me, s3me, s2me, s4me" are not in order. It sorts them based on that.

sort -k 2,3 misc.txt

Returns

Hullo a4 almost s1me
Hillo b3 almost s2me
Hello c2 almost s3me
Hallo d1 almost s4me

It sorts based on columns 2 and 3 only. If you notice, the first differences are "a,b,c,d", so it sorts based on those.

Combining sort with other commands

Just like any good Unix tool, sort can be chained with other terminal commands. For example, what if you want to sort your basketball.txt based on jersey numbers, but display only the last names and jersey number, with even alignments?

You can use awk to grab their last names and column to do the alignments:

sort -k 3n basketball.txt | awk '{print $2 "\t" $3}' | column -t

Returns:

Bryant        8
Jordan        23
Curry         30
Johnson       32
Abdul-Jabbar  33
O'Neal        34

Saves you hours of manual editing. Man, I love terminals.

Conclusion

Sort is useful to sort a structured data from terminal. In this article, you learned how to do basic sort, save your sorted data into a new file, reverse-sort, and sort based on X column. You also learn how to sort numbers and months. Sort can also be used to create random order. Finally, these options can be stacked together to create a more complex operation.

Thanks for reading.

May the sort be with you.