Monday, 6 August 2018

Sort command in Linux

The sort command is for sorting lines in text files. For example, if we have a file names, we can sort it with the sort command,

$ cat names John Doe Jane Doe John Roe Richard Roe Tommy Atkins Max Mustermann Erika Mustermann Joe Bloggs $ $ sort names Erika Mustermann Jane Doe Joe Bloggs John Doe John Roe Max Mustermann Richard Roe Tommy Atkins
The words in the input lines are fields, which are numbered 1 onwards. So if we want to sort based on last names, we can sort the above file on the second field.
$ sort -k2,2 names Tommy Atkins Joe Bloggs Jane Doe John Doe Erika Mustermann Max Mustermann John Roe Richard Roe

2.0 Sort order

Let's look at another example, the famous quotations by Lewis Carroll, and the sorted output.
$ cat rquote 1. "The time has come," the Walrus said, "To talk of many things: Of shoes--and ships-- and sealing wax-- Of cabbages--and kings." Lewis Carroll 2. "The White Rabbit put on his spectacles. 'Where shall I begin, please your Majesty?' he asked. 'Begin at the beginning,' the King said gravely, 'and go on till you come to the end: then stop.'" by Lewis Carroll: Alice in Wonderland $ $ sort rquote 1. 2. 'and go on till you come to the end: then stop.'" and sealing wax-- 'Begin at the beginning,' the King said gravely, by Lewis Carroll: Alice in Wonderland Lewis Carroll Of cabbages--and kings." Of shoes--and ships-- "The time has come," the Walrus said, "The White Rabbit put on his spectacles. "To talk of many things: 'Where shall I begin, please your Majesty?' he asked.
The output is not quite as expected because we have lowercase, followed by uppercase and then again lowercase as the first character of lines in the output. We can fix the error by setting the environment variable, LC_ALL=C. Actually, the comparisons are done on the basis of collating sequence specified by LC_COLLATE. But, LC_ALL overrides LC_COLLATE and it's better to set LC_ALL to C.
$ LC_ALL=C $ export LC_ALL $ sort rquote Lewis Carroll by Lewis Carroll: Alice in Wonderland "The White Rabbit put on his spectacles. "The time has come," "To talk of many things: 'Begin at the beginning,' the King said gravely, 'Where shall I begin, please your Majesty?' he asked. 'and go on till you come to the end: then stop.'" 1. 2. Of cabbages--and kings." Of shoes--and ships-- and sealing wax-- the Walrus said,
Looking at the first character of lines, the spaces come first (There is a space at the beginning of lines containing the author's name). Next we have the double quotes, followed by single quote and then digits. After that, we have the uppercase characters followed by lowercase. The sequence matches the ASCII character set sequence.

3.0 Sort in reverse order

The -r option reverses the sort order so that bigger key values appear earlier in output. For example, sort -r -k2,2 names sorts names in the reverse order of last names.
$ sort -r -k2,2 names Richard Roe John Roe Max Mustermann Erika Mustermann John Doe Jane Doe Joe Bloggs Tommy Atkins

4.0 Sort numerically

The -n option sorts based on the numeric value of strings. For example,
$ cat attendance John Doe 12 Jane Doe 5 John Roe 25 Richard Roe 3 Tommy Atkins 14 Max Mustermann 2 Erika Mustermann 24 Joe Bloggs 7 $ $ sort -n -k3,3 attendance Max Mustermann 2 Richard Roe 3 Jane Doe 5 Joe Bloggs 7 John Doe 12 Tommy Atkins 14 Erika Mustermann 24 John Roe 25

5.0 Sort in reverse numeric order

We can combine the -r option with the -n to sort in the reverse numeric order.
$ sort -nr -k3,3 attendance John Roe 25 Erika Mustermann 24 Tommy Atkins 14 John Doe 12 Joe Bloggs 7 Jane Doe 5 Richard Roe 3 Max Mustermann 2

6.0 Sort folding lowercase to upper case

The -f option does a case insensitive sort, folding lowercase letters to uppercase and treating the two as the same. For example, if some names had been typed in lowercase, we can still get the correct sort order using the -f option.
$ cat names John Doe jane doe John Roe richard roe Tommy Atkins Max Mustermann erika mustermann Joe Bloggs $ $ sort -f -k2,2 names Tommy Atkins Joe Bloggs John Doe jane doe Max Mustermann erika mustermann John Roe richard roe

7.0 Sort based on key

You can sort the lines based on one or more keys using the option, -k POS1[,POS2] for each key. POS1 and POS2 are the start and end positions of a key. If POS2 is omitted, the key is from POS1 to the end of the line. Each POS is defined as F[.C][OPTS]F is the field number. Field numbers start with 1. C is the start or end character position of the key inside the field. The character position starts with 1, which is the default value for the start position. The default value of Cfor the end position is the end of the field. Now the sort command has a lot of options. You can, optionally, apply these options to the key using one or more characters from the set, { b d f g h i M n R r V }. Suppose we wish to sort namesfirst on the last name and then on the first name,
$ sort -k2,2 -k1,1 names Tommy Atkins Joe Bloggs Jane Doe John Doe Erika Mustermann Max Mustermann John Roe Richard Roe
As another example, consider sorting on last name, first name and the numeric key, marks, in the reverse order.
$ cat class AC12 John Doe 112 Science AC12 John Doe 132 Mathematics RA11 Jane Doe 25 Art RA11 Jane Doe 171 Craft AP12 John Roe 123 Literature AL14 Richard Roe 43 Language AL14 Richard Roe 123 Literature YM17 Tommy Atkins 126 Mathematics AS12 Max Mustermann 121 Geography PT14 Erika Mustermann 181 History DE12 Joe Bloggs 171 Social Studies $ $ sort -k3,3 -k2,2 -k4,4rn class YM17 Tommy Atkins 126 Mathematics DE12 Joe Bloggs 171 Social Studies RA11 Jane Doe 171 Craft RA11 Jane Doe 25 Art AC12 John Doe 132 Mathematics AC12 John Doe 112 Science PT14 Erika Mustermann 181 History AS12 Max Mustermann 121 Geography AP12 John Roe 123 Literature AL14 Richard Roe 123 Literature AL14 Richard Roe 43 Language
As another example, consider the case when there are a bunch of log files named log1.gz, log2,gz, ., log100.gz, .., log200.gz. And we want the sorted directory listing.
$ ls log101.gz log102.gz log103.gz log104.gz log105.gz log106.gz log10.gz log1.gz log200.gz log20.gz $ ls | sort -t . -n -k1.4 log1.gz log10.gz log20.gz log101.gz log102.gz log103.gz log104.gz log105.gz log106.gz log200.gz
We define the field separator as .. The key is in the first field, fourth character onwards. This gives the proper sort order.

8.0 Sort using a different field separator

The default field separator is the transition from a non-blank character to the blank character. We can change this with the -t option. Suppose we want the sorted list of users, we can get it from the /etc/passwd file.
$ sort -t : -k1,1 /etc/passwd | awk -F: '{ print $1 }' avahi avahi-autoipd backup bin colord daemon ...
If the field separator is tab, special syntax is required for specifying the delimiter, as explained in ANSI-C Quoting. Suppose the first and last names are separated by a tab in names, we can sort on the last name as shown below.
$ cat names John Doe Jane Doe John Roe Richard Roe Tommy Atkins Max Mustermann Erika Mustermann Joe Bloggs $ sort -t $'\t' -k2,2 names Tommy Atkins Joe Bloggs Jane Doe John Doe Erika Mustermann Max Mustermann John Roe Richard Roe

9.0 Check if file is sorted

You can quickly checkup whether a file is already sorted using the -c or the (uppercase) -C option. The -c option prints the first out of order line. The -C option checks silently, doesn't print any diagnostic but quietly sets the return value as 1. For example,
$ sort -c names sort: names:2: disorder: Jane Doe $ echo $? 1 $ sort -C names $ echo $? 1

10.0 Output unique keys

With the -u option, you get lines with unique keys only in the output. If multiple lines have the same keys, only the first one occurring in the input is written to the output; the rest are discarded. For example,
$ cat xnames Jame Doe John Doe Jane Doe John Roe Richard Roe Tommy Atkins Max Mustermann Erika Mustermann Joe Bloggs $ $ sort -u -k2,2 xnames Tommy Atkins Joe Bloggs Jame Doe Max Mustermann John Roe

11.0 Merge Files

With the -m option, we can merge already sorted files. For example, if names and names1 are already sorted, we can merge them as shown below:
$ cat names Erika Mustermann Jane Doe Joe Bloggs John Doe John Roe Max Mustermann Richard Roe Tommy Atkins $ $ cat names1 Erica Mustermann Jack Doe Janney Doe John Bloggs Johnny Roe Ray Mustermann Richie Roe Thomas Atkins $ $ sort -m names names1 Erica Mustermann Erika Mustermann Jack Doe Jane Doe Janney Doe Joe Bloggs John Bloggs John Doe Johnny Roe John Roe Max Mustermann Ray Mustermann Richard Roe Richie Roe Thomas Atkins Tommy Atkins

12.0 Sort Options

The options for the sort command are summarized in a table below:
sort options
OptionDescription
-bIgnore leading blanks.
-dDictionary order. Only blanks and alphanumeric characters are considered.
-fIgnore case. Fold lowercase to uppercase characters.
-gGeneral numeric sort. Converts numbers to floating point for comparison. Not recommended as it is slower than the -n option.
-hHuman numeric sort. Sort first by sign, then SI suffix (blank, k, K or one of 'MGTPEZY') and finally by the numeric value.
-iIgnore non-printing characters. Sort considering only printable characters.
-MMonth sort, where JAN < FEB < ... < DEC.
-nNumeric sort. Sort considering the numeric value of strings.
-RRandom sort. Use a random hash function for input keys and then sort the hash values.
-rReverse the result of comparison. The greater key values come before the smaller ones.
-VVersion sort. Each number with decimal point is treated like a version name and number.

0 comments:

Post a Comment