The sort command is for sorting lines in text files. For example, if we have a file names, we can sort it with the sort command,
$ cat names
John Doe
Jane Doe
John Roe
Richard Roe
Tommy Atkins
Max Mustermann
Erika Mustermann
Joe Bloggs
$
$ sort names
Erika Mustermann
Jane Doe
Joe Bloggs
John Doe
John Roe
Max Mustermann
Richard Roe
Tommy Atkins
The words in the input lines are fields, which are numbered 1 onwards. So if we want to sort based on last names, we can sort the above file on the second field.
$ sort -k2,2 names
Tommy Atkins
Joe Bloggs
Jane Doe
John Doe
Erika Mustermann
Max Mustermann
John Roe
Richard Roe
2.0 Sort order
Let's look at another example, the famous quotations by Lewis Carroll, and the sorted output.
$ cat rquote
1.
"The time has come,"
the Walrus said,
"To talk of many things:
Of shoes--and ships--
and sealing wax--
Of cabbages--and kings."
Lewis Carroll
2.
"The White Rabbit put on his spectacles.
'Where shall I begin, please your Majesty?' he asked.
'Begin at the beginning,' the King said gravely,
'and go on till you come to the end: then stop.'"
by Lewis Carroll: Alice in Wonderland
$
$ sort rquote
1.
2.
'and go on till you come to the end: then stop.'"
and sealing wax--
'Begin at the beginning,' the King said gravely,
by Lewis Carroll: Alice in Wonderland
Lewis Carroll
Of cabbages--and kings."
Of shoes--and ships--
"The time has come,"
the Walrus said,
"The White Rabbit put on his spectacles.
"To talk of many things:
'Where shall I begin, please your Majesty?' he asked.
The output is not quite as expected because we have lowercase, followed by uppercase and then again lowercase as the first character of lines in the output. We can fix the
errorby setting the environment variable, LC_ALL=C. Actually, the comparisons are done on the basis of collating sequence specified by LC_COLLATE. But, LC_ALL overrides LC_COLLATE and it's better to set LC_ALL to C.
$ LC_ALL=C
$ export LC_ALL
$ sort rquote
Lewis Carroll
by Lewis Carroll: Alice in Wonderland
"The White Rabbit put on his spectacles.
"The time has come,"
"To talk of many things:
'Begin at the beginning,' the King said gravely,
'Where shall I begin, please your Majesty?' he asked.
'and go on till you come to the end: then stop.'"
1.
2.
Of cabbages--and kings."
Of shoes--and ships--
and sealing wax--
the Walrus said,
Looking at the first character of lines, the spaces come first (There is a space at the beginning of lines containing the author's name). Next we have the double quotes, followed by single quote and then digits. After that, we have the uppercase characters followed by lowercase. The sequence matches the ASCII character set sequence.
3.0 Sort in reverse order
The -r option reverses the sort order so that bigger key values appear earlier in output. For example, sort -r -k2,2 names sorts names in the reverse order of last names.
$ sort -r -k2,2 names
Richard Roe
John Roe
Max Mustermann
Erika Mustermann
John Doe
Jane Doe
Joe Bloggs
Tommy Atkins
4.0 Sort numerically
The -n option sorts based on the numeric value of strings. For example,
$ cat attendance
John Doe 12
Jane Doe 5
John Roe 25
Richard Roe 3
Tommy Atkins 14
Max Mustermann 2
Erika Mustermann 24
Joe Bloggs 7
$
$ sort -n -k3,3 attendance
Max Mustermann 2
Richard Roe 3
Jane Doe 5
Joe Bloggs 7
John Doe 12
Tommy Atkins 14
Erika Mustermann 24
John Roe 25
5.0 Sort in reverse numeric order
We can combine the -r option with the -n to sort in the reverse numeric order.
$ sort -nr -k3,3 attendance
John Roe 25
Erika Mustermann 24
Tommy Atkins 14
John Doe 12
Joe Bloggs 7
Jane Doe 5
Richard Roe 3
Max Mustermann 2
6.0 Sort folding lowercase to upper case
The -f option does a case insensitive sort, folding lowercase letters to uppercase and treating the two as the same. For example, if some names had been typed in lowercase, we can still get the correct sort order using the -f option.
$ cat names
John Doe
jane doe
John Roe
richard roe
Tommy Atkins
Max Mustermann
erika mustermann
Joe Bloggs
$
$ sort -f -k2,2 names
Tommy Atkins
Joe Bloggs
John Doe
jane doe
Max Mustermann
erika mustermann
John Roe
richard roe
7.0 Sort based on key
You can sort the lines based on one or more keys using the option, -k POS1[,POS2] for each key. POS1 and POS2 are the start and end positions of a key. If POS2 is omitted, the key is from POS1 to the end of the line. Each POS is defined as F[.C][OPTS]. F is the field number. Field numbers start with 1. C is the start or end character position of the key inside the field. The character position starts with 1, which is the default value for the start position. The default value of Cfor the end position is the end of the field. Now the sort command has a lot of options. You can, optionally, apply these options to the key using one or more characters from the set, { b d f g h i M n R r V }. Suppose we wish to sort namesfirst on the last name and then on the first name,
$ sort -k2,2 -k1,1 names
Tommy Atkins
Joe Bloggs
Jane Doe
John Doe
Erika Mustermann
Max Mustermann
John Roe
Richard Roe
As another example, consider sorting on last name, first name and the numeric key, marks, in the reverse order.
$ cat class
AC12 John Doe 112 Science
AC12 John Doe 132 Mathematics
RA11 Jane Doe 25 Art
RA11 Jane Doe 171 Craft
AP12 John Roe 123 Literature
AL14 Richard Roe 43 Language
AL14 Richard Roe 123 Literature
YM17 Tommy Atkins 126 Mathematics
AS12 Max Mustermann 121 Geography
PT14 Erika Mustermann 181 History
DE12 Joe Bloggs 171 Social Studies
$
$ sort -k3,3 -k2,2 -k4,4rn class
YM17 Tommy Atkins 126 Mathematics
DE12 Joe Bloggs 171 Social Studies
RA11 Jane Doe 171 Craft
RA11 Jane Doe 25 Art
AC12 John Doe 132 Mathematics
AC12 John Doe 112 Science
PT14 Erika Mustermann 181 History
AS12 Max Mustermann 121 Geography
AP12 John Roe 123 Literature
AL14 Richard Roe 123 Literature
AL14 Richard Roe 43 Language
As another example, consider the case when there are a bunch of log files named log1.gz, log2,gz, ., log100.gz, .., log200.gz. And we want the sorted directory listing.
$ ls
log101.gz log102.gz log103.gz log104.gz log105.gz log106.gz log10.gz log1.gz log200.gz log20.gz
$ ls | sort -t . -n -k1.4
log1.gz
log10.gz
log20.gz
log101.gz
log102.gz
log103.gz
log104.gz
log105.gz
log106.gz
log200.gz
We define the field separator as
.. The key is in the first field, fourth character onwards. This gives the proper sort order.
8.0 Sort using a different field separator
The default field separator is the transition from a non-blank character to the blank character. We can change this with the -t option. Suppose we want the sorted list of users, we can get it from the /etc/passwd file.
$ sort -t : -k1,1 /etc/passwd | awk -F: '{ print $1 }'
avahi
avahi-autoipd
backup
bin
colord
daemon
...
If the field separator is tab, special syntax is required for specifying the delimiter, as explained in ANSI-C Quoting. Suppose the first and last names are separated by a tab in names, we can sort on the last name as shown below.
$ cat names
John Doe
Jane Doe
John Roe
Richard Roe
Tommy Atkins
Max Mustermann
Erika Mustermann
Joe Bloggs
$ sort -t $'\t' -k2,2 names
Tommy Atkins
Joe Bloggs
Jane Doe
John Doe
Erika Mustermann
Max Mustermann
John Roe
Richard Roe
9.0 Check if file is sorted
You can quickly checkup whether a file is already sorted using the -c or the (uppercase) -C option. The -c option prints the first out of order line. The -C option checks
silently, doesn't print any diagnostic but quietly sets the return value as 1. For example,
$ sort -c names
sort: names:2: disorder: Jane Doe
$ echo $?
1
$ sort -C names
$ echo $?
1
10.0 Output unique keys
With the -u option, you get lines with unique keys only in the output. If multiple lines have the same keys, only the first one occurring in the input is written to the output; the rest are discarded. For example,
$ cat xnames
Jame Doe
John Doe
Jane Doe
John Roe
Richard Roe
Tommy Atkins
Max Mustermann
Erika Mustermann
Joe Bloggs
$
$ sort -u -k2,2 xnames
Tommy Atkins
Joe Bloggs
Jame Doe
Max Mustermann
John Roe
11.0 Merge Files
With the -m option, we can merge already sorted files. For example, if names and names1 are already sorted, we can merge them as shown below:
$ cat names
Erika Mustermann
Jane Doe
Joe Bloggs
John Doe
John Roe
Max Mustermann
Richard Roe
Tommy Atkins
$
$ cat names1
Erica Mustermann
Jack Doe
Janney Doe
John Bloggs
Johnny Roe
Ray Mustermann
Richie Roe
Thomas Atkins
$
$ sort -m names names1
Erica Mustermann
Erika Mustermann
Jack Doe
Jane Doe
Janney Doe
Joe Bloggs
John Bloggs
John Doe
Johnny Roe
John Roe
Max Mustermann
Ray Mustermann
Richard Roe
Richie Roe
Thomas Atkins
Tommy Atkins
12.0 Sort Options
The options for the sort command are summarized in a table below:
Option | Description |
---|---|
-b | Ignore leading blanks. |
-d | Dictionary order. Only blanks and alphanumeric characters are considered. |
-f | Ignore case. Fold lowercase to uppercase characters. |
-g | General numeric sort. Converts numbers to floating point for comparison. Not recommended as it is slower than the -n option. |
-h | Human numeric sort. Sort first by sign, then SI suffix (blank, k, K or one of 'MGTPEZY') and finally by the numeric value. |
-i | Ignore non-printing characters. Sort considering only printable characters. |
-M | Month sort, where JAN < FEB < ... < DEC. |
-n | Numeric sort. Sort considering the numeric value of strings. |
-R | Random sort. Use a random hash function for input keys and then sort the hash values. |
-r | Reverse the result of comparison. The greater key values come before the smaller ones. |
-V | Version sort. Each number with decimal point is treated like a version name and number. |
0 comments:
Post a Comment