Thursday 1 August 2019

Linux: Text Processing: grep, cat, awk, uniq

This page is a basic tutorial on using Linux shell's text processing tools. They are especially useful for processing lines.

Get Lines: grep
grep is the most important command. You should master it.

Show Matching Lines
# show lines containing xyz in myFile
grep 'xyz' myFile
# show lines containing xyz in all files ending in html in current dir top level files
grep 'xyz' *html
Grep for All Files in a Dir
# show matching lines in dir and subdir, file name ending in html
grep -r 'xyz' --include='*html' ~/web
Here's what the options mean:

-r → all subdirectories.
--include='*html' → match file name by a glob pattern (* is a wildcard that matches 0 or more any char.).
grep without regex
Use the option -F. (F means “Fixed string”)

# search ruby source files that contains  .* literally
grep -F '.*' *rb
This is useful when you want to search complicated string in source code, such as *@$.*#+-/\|`.

If your string is really complicated, you can put it in a file, and use the option --file=my_pattern_filename for the search text. Example:

# search js source code in dir and all subdirs. The regex is stored in file named myPattern.txt
grep -r --file=myPattern.txt --include=*js .
Most Useful Grep Options


Options for Pattern String
-F → use fixed string. (no regex)
-P → use Perl's regex syntax. (Perl and Python's regex are basically compatible.)
-i → ignore case.
-v → print lines NOT containing the pattern.
Examples:

# print lines not matching a string, for all files ending in “log”
grep -v 'html HTTP' *log
# print lines containing “png HTTP” or “jpg HTTP”
grep -P 'png HTTP|jpg HTTP' *log
Options for File Selection
*.html = search all files ending in ".html”, in current dir. (files in subdir are ignored)
grep -r --include='*html' pattern dirname = search files for pattern in dirname including subdirs, but only files ending in ".html”.
Output Options
-H = include file name in the result.
-h = do NOT print file name.
-l = print just file name; do NOT print the matched lines.
-L = print just file name that does NOT match.
More Grep Examples
# print lines containing “html HTTP” in a log file, show only the 12th and 7th columns, show only certain lines, then sort, then condense repeation with count, then sort that by the count.

grep 'html HTTP' apache.log | awk '{print $12 , $7}' | grep -i -P "livejournal|blogspot" | sort | uniq -c | sort -n
# print all links in all html files of a dir, except certain links. Output to xx.txt

grep -r --include='*html' -F 'http://' ~/web | grep -v -P 'google.com|twitter.com|reddit.com|wikipedia.org' > xx.txt
text columns, awk, sort, unique, sum column …
show only nth column in a text file
# print the 7th column. (columns are separated by spaces by default.)
cat myFile | awk '{print $7}'
For delimiter other than space, for example tab, use -F option. Example:

# print 12th atd 7th column, Tab is the separator
cat myFile | awk -F\t '{print $12 , $7}'
Alternative solution is to use the cut utility, but it does not accept regex as delimeters. So, if you have column separated by different number of spaces, “cut” cannot do it.

remove duplicate lines
sort myFile -u
or

sort myFile | uniq
To prepend the line with a count of repetition, use sort myFile | uniq -c

sum up 2nd column
awk '{sum += $2} END {print sum}' file_name → sum the 2nd column in a file.

show only first few lines of a huge file
head file_name → show first n lines of a file.

head -n 100 file_name → show first 100 lines of a file.

tail file_name → show the last n lines of a file.

head -n 100 file_name → show last 100 lines of a file.

0 comments:

Post a Comment