Awk can do very useful things with as little as one line of code, only few other programming languages can do so much with so little. In this article, I show some examples of these one liners.
Unix/Linux word count utility
awk '{ C += length($0) +1; W += NF } END {print NR, W, C}'
To print origional data values and their logarithms for one column datafiles
awk '{print $1, log($1) }' file(s)
To print a random sample of about 5 percent of the lines from text file
awk 'rand() < 0.05' file(s)
Reporting the sum of the nth column in tables with whitespace separated columns
awk -v COLUMN=n '{ sum += $COLUMN } END { print sum }' file(s)
Report the average of column n
awk -v COLUMN=n '{ sum += $COLUMN } END { print sum / NR }' file(s)
To print the sum of an amount in the last field( number of columns are vary)
awk '{ sum += $NF ; print $0, sum}' file(s)
Some simple ways to search for text in files
egrep 'pattern|pattern' file(s) awk '/pattern|pattern/' file(s) awk '/pattern|pattern/ {print FILENAME ":" FNR ":" $0 }' file(s)
Search range of lines
Search lines between 100-150 for the text
awk '{100 <= FNR ) && ( FNR <= 150 ) && /pattern/ {print FILENAME ":" FNR ":" $0 }' file(s)
An alternative way in shell
sed -n -e 100,150p -s file(s) | egrep 'pattern'
To swap the second and third columns in a four column table, assuming tab separators, use any of them below
awk -F'\t' -v OFS='\t' '{print $1,$3,$2,$4}' old >new awk 'BEGIN { FS = OFS ="\t" } {print $1,$3,$2,$4}' old >new
To convert column separators from tab to ampersand
sed -e 's/tab/\&/g' file(s) awk 'BEGIN { FS = "\T"; OFS = "&" } { $1 = $1; print }' file(s)
To eliminate duplicate lines from a sorted stream
sort file(s) | uniq sort file(s) | awk 'Last != $0 { print } { Last = $0 }'
To convert carriage return/newline line terminators to newline terminators, use one of them below
sed -e 's/\r$//' file(s) sed -e 's/^M$//' file(s) mawk 'BEGIN { RS = "\r\n" } { print }' file(s)
Note:
The first sed example needs a modern version that recognizes escape sequences.
In the second example, ^M represents a literal Ctrl-M(Carriage return) character.
For the third example, we need either gawk or mawk because nawk and POSIX awk do not support more than a single character in RS.
To convert single spaced text lines to double spaced lines, use any of these
sed -e 's/$/\n' file(s) awk 'BEGIN { ORS ="\n\n" } { print }' file(s) awk 'BEGIN { ORS = "\n\n" } 1' file(s) awk '{print $0 "\n" }' file(s) awk '{print; print ""}' file(s)
Conversion of double spaced lines to single spacing is equally easy
gwak 'BEGIN { RS="\n *\n" } { print }' file(s)
To strip angle bracketed markup tags from HTML documents, treat the tags as record separators, like this:
mawk 'BEGIN { ORS = " "; RS = "<[^<>]*>" } { print }' *.html
By setting ORS to a space, HTML markup gets converted to a space, and all input line breaks are preserved.
To extract all of the titles from a collection of XML documents
mawk -v ORS=' ' -v RS='[ \n]' '/<title *>/, /<\title *>/' *.xml | sed -e 's@<title *> *@&\n@g
In the example above, it extracts the titles from XML documents, print them one title per line, with surrounding markup. it works correctly even when the titles span multiple lines, and handles the uncommon, but legal, case of spaces between the tag word and the closing angle bracket
0 comments:
Post a Comment