Wednesday 31 July 2019

One line programs in awk

Awk can do very useful things with as little as one line of code, only few other programming languages can do so much with so little. In this article, I show some examples of these one liners. 

Unix/Linux word count utility

awk '{ C += length($0) +1; W += NF } END {print NR, W, C}'

To print origional data values and their logarithms for one column datafiles

awk '{print $1, log($1) }' file(s)

To print a random sample of about 5 percent of the lines from text file

awk 'rand() < 0.05' file(s)

Reporting the sum of the nth column in tables with whitespace separated columns

awk -v COLUMN=n '{ sum += $COLUMN } END { print sum }' file(s)

Report the average of column n

awk -v COLUMN=n '{ sum += $COLUMN } END { print sum / NR }' file(s)

To print the sum of an amount in the last field( number of columns are vary)

awk '{ sum += $NF ; print $0, sum}' file(s)

Some simple ways to search for text in files

egrep 'pattern|pattern' file(s)
awk '/pattern|pattern/' file(s)
awk '/pattern|pattern/ {print FILENAME ":" FNR ":" $0 }' file(s)

Search range of lines

Search lines between 100-150 for the text
awk '{100 <= FNR ) && ( FNR <= 150 ) && /pattern/ {print FILENAME ":" FNR ":" $0 }' file(s)
An alternative way in shell
sed -n -e 100,150p -s file(s) | egrep 'pattern'

To swap the second and third columns in a four column table, assuming tab separators, use any of them below

awk -F'\t' -v OFS='\t' '{print $1,$3,$2,$4}' old >new
awk 'BEGIN { FS = OFS ="\t" } {print $1,$3,$2,$4}' old >new

To convert column separators from tab to ampersand

sed -e 's/tab/\&/g' file(s)
awk 'BEGIN { FS = "\T"; OFS = "&" } { $1 = $1; print }' file(s)

To eliminate duplicate lines from a sorted stream

sort file(s) | uniq
sort file(s) | awk 'Last != $0 { print } { Last = $0 }'

To convert carriage return/newline line terminators to newline terminators, use one of them below

sed -e 's/\r$//' file(s)
sed -e 's/^M$//' file(s)
mawk 'BEGIN { RS = "\r\n" } { print }' file(s)
Note:
The first sed example needs a modern version that recognizes escape sequences.
In the second example, ^M represents a literal Ctrl-M(Carriage return) character.
For the third example, we need either gawk or mawk because nawk and POSIX awk do not support more than a single character in RS.

To convert single spaced text lines to double spaced lines, use any of these

sed -e 's/$/\n' file(s)
awk 'BEGIN { ORS ="\n\n" } { print }' file(s)
awk 'BEGIN { ORS = "\n\n" } 1' file(s)
awk '{print $0 "\n" }' file(s)
awk '{print; print ""}' file(s)

Conversion of double spaced lines to single spacing is equally easy

gwak 'BEGIN { RS="\n *\n" } { print }' file(s)

To strip angle bracketed markup tags from HTML documents, treat the tags as record separators, like this:

mawk 'BEGIN { ORS = " "; RS = "<[^<>]*>" } { print }' *.html
By setting ORS to a space, HTML markup gets converted to a space, and all input line breaks are preserved.

To extract all of the titles from a collection of XML documents

mawk -v ORS=' ' -v RS='[ \n]' '/<title *>/, /<\title *>/' *.xml | sed -e 's@<title *> *@&\n@g
In the example above, it extracts the titles from XML documents, print them one title per line, with surrounding markup. it works correctly even when the titles span multiple lines, and handles the uncommon, but legal, case of spaces between the tag word and the closing angle bracket

0 comments:

Post a Comment