Tuesday, 6 August 2019

Linux - Use Extended Regular Expressions with grep command

This tutorial explains how to use extended regular expressions with grep command in detail. Learn what extended regular expressions are and how they work with grep command through a practical example that extracts all links form a html file.
Extended regular expressions
A regular expression is a search pattern that grep command matches in specified file or in provided text. In order to allow a user to express the regular expression in more customized way, grep assigns special meanings to few characters. These characters are known as Meta characters. Initially, grep assigned the characters ^ $ . [ ] and * as Meta characters. Later few more characters were added in this list. These were ( ) { } ? + and |.
Based on the use of Meta characters, a regular expression can be divided in two categories; BRE (Basic Regular Expression) and ERE (Extended Regular Expression).
Basic Regular Expression: - An expression which uses the default Meta characters.
Extended Regular Expression: - An expression which uses the later added Meta characters.

How to use extended regular expression

Extended regular expression uses the Meta characters which were added later. Since later added characters are not defined in original implementation, grep treats them as regular characters unless we ask it to use them as Meta characters.
To instruct grep command to use later added characters as Meta characters, an option –E is used. Let's take an example. In original implementation, the pipe sign (|) is defined as regular character while in new implementation, it is defined as a Meta character.
If we use pipe sign without –E option, grep will treat it as a regular character. But if we use it with –E option, grep will treat it as a Meta character. As a Meta character, it is used to search multiple words. Let's search two users' information in file /etc/passwd with and without –E option.
#grep "sanjay|rick" /etc/passwd
#grep –E "sanjay|rick" /etc/passwd
how to use extended regular expression
Without –E option, grep searched the pattern as a single word sanjay|rick in the file /etc/passwd. While with –E option, it separated the pattern in two words sanjay and rick and searched them individually.

grep extended regex (search multiple words)

The pipe sign (|) is used to search multiple words with grep command. To search multiple words with grep command, connect all of them with pipe sign and surround by quote signs. For example to search words abc, fgh, xyz, mno and jkl, use the search pattern "abc|fgh|xyz|mno|jkl".
 #grep –E "abc|fgh|xyz|mno|jkl" dummy-file

grep extended regex (search all links with linked text from an html file)

To extract all links from an html file named html_file, use following command.
#grep –Eoi '<a[^>]+>.*</a>' html_file
Let's understand above command in detail.
grep command options
-E: - This option instructs grep command that search pattern contains the Meta characters which were added later.
o:- By default, grep prints entire line which contains the search pattern. This option instructs grep command to print only the matching words instead of entire line.
i:- This option ask grep command to ignore the case while matching the pattern.
Extended regular expression
<a :- Starting point of anchor tag.
[^ >] :- Match everything except >.
+ :- Match preceding one or more time.
> :- Ending point of anchor tag.
So far this search string says, search for the text which starts with <a and contains anything after it except > sing (because this sign is used to end the tag and we need some values before end otherwise it will become <a > which is not a valid anchor tag) and ends with > sign.
This string is followed by a Meta character + which instructs grep command to match it one or more times.
Meta character dot (.) represents any single character and star (*) represents any number of characters. We used both together to search for any characters between starting and closing anchor tag.
</a> :- This is the closing point of anchor tag.
Collectively above search patterns says search a text string which
starts with <a and contains some value and ends with > and again contains any value and ends with </a>
In more simple language <a some value > any value </a>.
Following figure illustrates the use of above regex
search all anchor tag in html file

grep regex (print only anchor tag)

If you are interested only in anchor tags, you can exclude the expression which prints the linked text as following.
#grep –Eoi '<a[^>]+>' html_file
grep show only href tag

grep regex ( extract all links or URLs from an html file and save them in a text file)

To extract all links or URLs from an html file and save them in a text file, we have to combine three commands. These commands are: -
#grep –Eoi '<a[^>]+>' html_file
#grep -Eo 'href= "[^\"]+" '
#grep –Eo '(http|https)://[^/"]+1 > link-only
We have to combine these commands in following way.
#grep –Eoi '<a[^>]+>' html_file | grep -Eo 'href= "[^\"]+" ' | grep –Eo '(http|https)://[^/"]+1 > link-only
In above commands,
  • First command receives its input from file named html_file, second command receives its input form first command and third command receives its input from second command.
  • First command extracts all anchor attributes from html file and sends output to the second command instead of printing it at command prompt.
  • Second command extracts all href tags from the output of the first command and sends output to the third command.
  • Third command extracts all links from the output of the second command and save output to a text file named link-only.
Following figures explains above commands with output.
grep save all links from a html file

0 comments:

Post a Comment