Monday, 6 August 2018

Awk Tutorial

awk is a filter which takes the input and gives output after matching desired patterns and doing processing linked to matched patterns. awk is well suited for processing voluminous raw text data and producing statistical reports from it.

awk is named after the first character of the names of its designers, Al Aho, Peter Weinberger and Brian Kernighan, who developed it for Unix systems in the 1970s. Since then, many versions of awk have been released and the prominent among them is gawk, available with the present-day GNU/Linux distributions. On most Linux systems, awk is a symbolic link to gawk.

2.0 awk Command syntax

awk 'program' filenames ... where, program is pattern { action } pattern { action } ...
The pattern may be a regular expression as in grep or sed, or it can be a C language style conditional expression, like length ($2) > 12. If no pattern is specified, it means match all lines.
The action is a list of awk statements. If the action part is missing, the default action is to print the matched line(s). At least one pattern or action must be present in the program. Also, the action must start on the same line as its pattern (if there is a pattern for that action). If the line containing the pattern becomes too big, it can be continued to the next line by using a backslash (\) at the end of the line.
awk reads input, one line at a time. Each line of input is considered a record. For each line, read it scans each pattern. If a pattern matches the line, the associated action statements are run on that line of input.

3.0 Fields

awk breaks each line of input into fields numbered $1, $2, ..., $NF. A field is word, a sequence of non-blank characters. The default field separator is one or more spaces and tabs. The built-in variable NF contains the number of fields in an input line. $0 is the entire line of input. Suppose we wish to know the user-id, process-id, CPU time and command name of top five processes with maximum CPU usage, we can print the first, second, seventh and eighth fields of the pscommand output lines.
$ ps -ef --no-headers --sort -pcpu | sed '5q' | awk ' { print $1, $2, $7, $8 }' user1 2728 00:39:20 /opt/google/chrome/chrome user1 2850 00:18:25 /opt/google/chrome/chrome user1 2503 00:07:05 /opt/google/chrome/chrome user1 2785 00:03:07 /opt/google/chrome/chrome user1 2071 00:02:51 compiz
The built-in variable, NF, contains the number of fields in an input line. A blank line has no fields and NF is zero. The following command deletes blank lines from a file.
$ awk 'NF > 0 { print }' filename
The default field separator is whitespace characters space and tab, but with the -F option, we can set it to any single character. For example, to list the user names from the /etc/passwd file, we set the field separator to the colon character,
$ awk -F: '{ print $1}' /etc/passwd root daemon bin sys sync ...

4.0 awk Built-in Variables

awk provides built-in variables which are initialized and updated by awk. These can be read and changed in the program.
awk - Built-in Variables (subset)
VariableDescription
FILENAMEThe file being processed currently by awk.
FSField Separator. Default is tabs and spaces.
NFNumber of fields in an input line (record).
NRInput record (line) number.
FNRInput record number of the current file.
RSInput record separator. The default value is newline.
OFSOutput field separator. The default value is spaces.
ORSOutput record separator. The default value is newline
OFMTOutput format for numbers. The default is %g, which uses %e or %f, whichever gives shorter output.
CONVFMTConversion format. Controls conversion of numbers to string, default value being "%.6g".
PROCINFOThe elements of PROCINFO array provide information about the awk process running the program.
PROCINFO ["pid"]The process id of the awk process.
PROCINFO ["ppid"]The process id of the parent of the awk process.
PROCINFO ["sorted_in"]The value of this element controls the order in which the array indexes would be processed by the for (index in array) loop.
SUBSEPSubscript separator for multidimensional arrays. The default value is "\034".
For example, if we have a bunch of files and we wish to print these files with each line having the line number and file name, we can use a script like the following,
awk ' { if (old_filename != FILENAME) if (NR != 1) { NR = 1 printf ("\n") } printf "%4d %s: %s\n", NR, FILENAME, $0 old_filename = FILENAME } ' $*
The built-in variable FNR has the record number of the current file. So the above script can be written as,
awk ' { printf "%4d %s: %s\n", FNR, FILENAME, $0 } ' $*

5.0 Special Patterns: BEGIN and END

awk has two special patterns, BEGIN and END. The action corresponding to BEGIN is executed at the start of the script, before the first line of input is processed. BEGIN is useful for setting things up like initializing variables, printing headings, etc. It is not necessary to initialize variables to zero, as awk does that by default. The action related to the END pattern is executed at the end of the script, after the last line of input has been processed. It is useful for printing summary, grand totals, etc. For example, the following script can be used to print all the user ids and associated shells and also a count of all user ids in /etc/passwd.
awk ' BEGIN { FS = ":" printf ("\nUser id\t\tShell\n\n") } { printf ("%s\t\t%s\n", $1, $7) } END { printf "\nTotal numer of user ids = %d\n", NR } ' /etc/passwd
As another example, suppose we have a Comma Separated File, expenditure.csv, with each line giving certain expenditures for three years,
$ cat expenditure.csv Expenditure,2012,2013,2014 Advertising,200015,233912,189928 Bank charges,23029,26667,34990 Boarding and Lodging,237899,453326,356625 Communication,34556,28928,34222 Conveyance,23444,27889,43882 Insurance,43344,41992,45667 Repairs and maintenance,26609,19887,29008 Servers,34556,35662,32118 Stationery,38828,28779,39887 Sub-contracting,29988,30992,21334 Travel,891112,788277,655489
We can process it in awk and generate total expenditure for each year.
$ awk ' BEGIN { FS = "," } > NR == 1 { printf ("%s,%s,%s,%s\n", $1, $2, $3, $4) } > NR > 1 { sum2 += $2 > sum3 += $3 > sum4 += $4 > printf ("%s,%6d,%6d,%6d\n", $1, $2, $3, $4) } > END { printf ("Totals,%6d,%6d,%6d\n", sum2, sum3, sum4) } > ' expenditure.csv Expenditure,2012,2013,2014 Advertising,200015,233912,189928 Bank charges, 23029, 26667, 34990 Boarding and Lodging,237899,453326,356625 Communication, 34556, 28928, 34222 Conveyance, 23444, 27889, 43882 Insurance, 43344, 41992, 45667 Repairs and maintenance, 26609, 19887, 29008 Servers, 34556, 35662, 32118 Stationery, 38828, 28779, 39887 Sub-contracting, 29988, 30992, 21334 Travel,891112,788277,655489 Totals,1583380,1716311,1483150

6.0 Operators

The operators are as given below in the decreasing order of precedence.
awk Operators (decreasing order of precedence)
OperatorsDescription
++ --Increment and decrement (prefix and postfix).
^Power.
* / %Multiply, divide and modulus (remainder).
+ -Add and subtract.
nothingString concatenation.
> >= < <= == != ~ !~Relational operators.
!Negate expression.
&&Logical AND.
||Logical OR.
= += -= *= /= %= ^=Assignment.
There is no explicit string concatenation operator. If there are two adjacent strings in an expression, the strings are concatenated. The expression x ^ y computes x to the power y. The relational operators ~ and !~ match the string on the left with the regular expression on the right and return true and false depending upon match or no match respectively. The logical AND operator, && computes expr1 && expr2, which is true if and only if both expr1 and expr2 are trueexpr2is not computed if expr1 evaluates to false. Similarly, the logical OR operator, || computes expr1 || expr2, which is trueif either expr1 or expr2 is trueexpr2 is not computed if expr1 evaluates to true. The assignment of the form op=expr is y = y op expr.

7.0 Conversions

awk converts numbers into strings and vice-versa based on the context. For example,
$ cat conv awk ' { a = "That is gr" b = 8 c = "9" d = 4.12345678 e = a b # concatenate string and number f = b + c # add number and string g = d " " # concatenate number and string printf ("e = %s\n", e) printf ("f = %d\n", f) printf ("g = %s\n", g) CONVFMT = "%2.3f" h = b " " # convert integer to string printf ("d = %s\n", d) printf ("h = %s\n", h) } ' $ ./conv e = That is gr8 f = 17 g = 4.12346 d = 4.123 h = 8
In the above example, the standard input is the input file. That is, you need to enter a line of input on the keyboard to get the above output. The string g is obtained by concatenating the number d with a string containing a space. While converting numbers to string the built-in variable CONVFMT is used, which has a default value of "%.6g" and gives a maximum of six significant digits. If we change CONVFMT to "%2.3f", the conversion of floating point variable d changes accordingly. However, CONVFMT is not used for converting integers to string and the integer variable h is printed as string 8 and not 8.000.
You can convert a number to a string by concatenating it with a space. Similarly, you can convert a string to a number by adding zero to it.

8.0 Arrays

awk provides arrays. For example,
$ cat tarray awk ' { for (i = 0; i < 4; i++) p [i] = 5 + i * 2 for (j in p) printf ("%d ", p[j]) printf ("\n") } ' $ ./tarray 5 7 9 11
The arrays in awk are quite different from those in the C language. There are no array declarations and so the array size is not prefixed. Most significantly, the array index is an string. An array is a collection of name-value pairs. Such arrays are known as associative arrays, as these arrays associate values to names. In the above example, we used integers as the array index. awk converts based on the context and stores values for p ["0"], p ["1"], p ["2"], etc. The array index can be any valid string. For example,
$ cat tarray1 awk ' { fruit ["apple"] = 4 fruit ["mango"] = 12 fruit ["guava"] = 8 fruit ["banana"] = 16 for (j in fruit) printf ("%s: %d numbers\n", j, fruit [j]) } ' $ ./tarray1 guava: 8 numbers mango: 12 numbers apple: 4 numbers banana: 16 numbers
The order in which the indexes are processed depends on the built-in array element PROCINFO ["sorted_in"]. The values of PROCINFO ["sorted_in"] can be,
PROCINFO ["sorted_in"] values
PROCINFO ["sorted_in"]Description
@unsortedArray indexes are processed in arbitrary order (default awk behavior).
@ind_str_ascThe array is sorted with indexes compared as strings in ascending order.
@ind_num_ascThe array is sorted with indexes compared as numbers in ascending order. Non-numeric indexes are treated as zero.
@val_type_ascThe array is sorted based on values as per its type in ascending order. All numbers come before the strings. The sub-arrays come after the strings.
@val_str_ascThe array is sorted based on values of elements, treating the values as strings, in ascending order.
@val_num_ascThe array is sorted based on values of elements, treating values as numbers, in ascending order.
@ind_str_descThe array is sorted based on index, treated as strings, in descending order.
@ind_num_descThe array is sorted based on index, treated as numbers, in descending order.
@val_type_descThe array is sorted based on the value of the element as per its type in descending order. Subarrays come first, then the strings and lastly, the numbers.
@val_str_descThe array is sorted based on element values, treated as strings, in descending order.
@val_num_descThe array is sorted based on values, treated as numbers, in descending order.
We can sort the array in the earlier example based on index, treated as strings, in ascending order.
$ cat tarray1 awk ' BEGIN {PROCINFO ["sorted_in"] = "@ind_str_asc" } { fruit ["apple"] = 4 fruit ["mango"] = 12 fruit ["guava"] = 8 fruit ["banana"] = 16 for (j in fruit) printf ("%s: %d numbers\n", j, fruit [j]) } ' $ ./tarray1 apple: 4 numbers banana: 16 numbers guava: 8 numbers mango: 12 numbers
And sorting on element values, treating the values as numbers, in descending order,
$ cat tarray1 awk ' BEGIN {PROCINFO ["sorted_in"] = "@val_num_desc" } { fruit ["apple"] = 4 fruit ["mango"] = 12 fruit ["guava"] = 8 fruit ["banana"] = 16 for (j in fruit) printf ("%s: %d numbers\n", j, fruit [j]) } ' $ ./tarray1 banana: 16 numbers mango: 12 numbers guava: 8 numbers apple: 4 numbers

8.1 Basic Array Operations

We have seen how to add elements to arrays. There are two other basic operations. First, we can check whether an index is present in an array using the in operator. Second, we can delete an element or the entire array using the delete statement. For example,
$ cat tarray1 awk ' { fruit ["apple"] = 4 fruit ["mango"] = 12 fruit ["guava"] = 8 fruit ["banana"] = 16 if ("apple" in fruit) print "We have apple." else print "We do not have apple." delete fruit ["apple"] print "fruit [\"apple\"] has been deleted." if ("apple" in fruit) print "We have apple." else print "We do not have apple." } ' $ ./tarray1 We have apple. fruit ["apple"] has been deleted. We do not have apple.

8.2 Multidimensional arrays

awk provides multidimensional arrays like most programming languages. The array indexes are written inside square brackets like arr [i, j, k]. awk concatenates the indexes with substring separator built-in variable, SUBSEP in between and makes a single index for internal storage. For example, if SUBSEP has a value of @, for an element fruit [25, apple, 7], awk stores fruit [25@apple@7] internally. The default value for SUBSEP is \034. For example,
$ cat tarray3 awk ' { arr [0,0] = 45 arr [2, 3] = "hello" arr [1, 1] = "hi" for (i = 0; i < 3; i++) for (j = 0; j < 4; j++) if ((i, j) in arr) printf ("arr [%d, %d] = %s\n", i, j, arr [i, j]) } ' $ ./tarray3 arr [0, 0] = 45 arr [1, 1] = hi arr [2, 3] = hello
We can delete an array but we cannot use the same variable name for a scalar variable. For example,
$ cat tarray3 awk ' { arr [0] = 45 delete arr arr = 3 } ' $ ./tarray3 awk: cmd. line:4: (FILENAME=- FNR=1) fatal: attempt to use array `arr' in a scalar context

8.3 Array of Arrays

The multidimensional arrays described above are part of standard awk. However, gawk provides true array of arrays wherein an array element can itself be an array. The array elements which are also arrays are called subarrays. Arrays need not be rectangular and some rows might have different number of elements. The isarray () function can be used to check whether an element is an array.
$ cat tarray2 awk ' BEGIN {PROCINFO ["sorted_in"] = "@ind_str_asc" } { fruit ["apple"]["study"] = 4 fruit ["apple"]["kitchen"] = 5 fruit ["mango"] = 12 fruit ["guava"] = 8 fruit ["banana"] = 16 for (j in fruit) if (isarray (fruit [j])) { for (k in fruit [j]) printf ("%s in %s = %d numbers\n", j, k, fruit [j][k]) } else printf ("%s: %d numbers\n", j, fruit [j]) } ' $ ./tarray2 apple in kitchen = 5 numbers apple in study = 4 numbers banana: 16 numbers guava: 8 numbers mango: 12 numbers
There should not be any space between brackets containing indexes.
$ cat tarray2 awk ' { fruit ["apple"] ["study"] = 4 ... $ ./tarray2 awk: cmd. line:3: fruit ["apple"] ["study"] = 4 awk: cmd. line:3: ^ syntax error
We can delete an element which is a subarray and replace it with a scalar. But we cannot delete the main array and replace it with a scalar.
$ cat tarray4 awk ' { fruit ["apple"]["study"] = 4 fruit ["apple"]["kitchen"] = 5 fruit ["mango"] = 12 fruit ["guava"] = 8 fruit ["banana"] = 16 delete fruit ["apple"] fruit ["apple"] = 7 printf ("Now we have %d apples.\n", fruit ["apple"]) delete fruit ["apple"] } ' $ ./tarray4 Now we have 7 apples.

9.0 Quoting

We have been using single quotes around the awk program and double quotes for strings inside the program. But what about the case when you want to print a single quote as a part of a string inside the program? Suppose you want to print the string, That's great!.
$ awk ' { a = "That'"'"'s great!" printf ("a = %s\n", a) } ' a = That's great!
How does it work? Well, it's all about the shell and not about awk. If we put multiple strings together in shell, without any space in between, the shell concatenates these into a single string. In the above script, there are three strings, put next to one another. The first one starts with a single quote before the left curly brace and ends with a single quote after That. The next string is between two double quotes and contains a single quote. The third string starts right after the second string. It starts with a single quote and ends with a single quote after the right curly brace. The shell concatenates the three into a single argument and passes it to awk.

10.0 awk Built-in Functions

The built-in functions help in developing the awk program. Some of the built-in functions are,
awk Built-in Functions (subset)
FunctionDescription
atan2 (y, x)arctangent of (y/x) in radians.
cos (x)cosine of x, where x is in radians.
exp (x)Exponential of x, ex
gsub (regexp, replacement [, target])Search target or $0, if target is not given. Match regexp, and replace with replacement for all occurrences.
index (s1, s2)Position of string s2 in s1. String indexes start from 1. Returns 0 if not found.
int (expr)Integer part of expr with the decimal part getting truncated.
length ([string])Length of string, or that of $0, if string is not given.
log (x)Natural logarithm of x.
match(string, regexp [, array])Search the string for longest leftmost match of regexp. Return the index of the match in string. If array is present, return the matched part of string in array [0], and parenthesized substring matches in array [1], array [2], etc.
rand ()Return random number between 0 and 1.
sin (x)Sine of x.
split (string, array [, fieldsep [, seps ] ])Split string using fieldsep, storing pieces in array [1], array [2], ... and storing the strings between pieces in sep. Returns number of pieces. If fieldsep is not given, FS is used.
sprintf (format, expression1, …)Returns the string formatting expression1, ..., as per the format.
sqrt (x)Square root of x.
srand ([x])Seed for the random number function. If x is not given, current system time is used.
sub (regexp, replacement [, target])Search target or $0, if target is not given. Match regexp, and replace with replacement for the first occurrence.
substr (s, m [, n])Returns the substring of string s, starting at index mand of length n. If n is not given, the substring starts at index m in s and contains the rest of s.
tolower (string)Returns a copy of string replacing each uppercase character to its lowercase counterpart.
toupper (string)Returns a copy of string replacing each lowercase character to its uppercase counterpart.

11.0 awk 1

At the beginning of this post, we saw that awk syntax is,
awk 'program' filenames ... where, program is pattern { action } pattern { action } ...
It turns out that the pattern 1, or for that matter, any positive integer, evaluates to true. The default action is to print $0. So,
$ awk '1' file
prints the file.

0 comments:

Post a Comment