Thursday, 1 August 2019

Working with grep and sed

In this section, you'll learn about two important utilities used for working with files and output. The grep family of utilities is used to find values, and the sed utility can change the value of strings. Many of the utilities discussed previously will be used in the examples in this section.

The grep Family
The utility with the funny name (something common in Linux) is really an acronym for the function that it performs: "Globally look for Regular Expressions and then Print the results." In layman's terms, it is one of the most advanced search utilities you can use. In order to be proficient with it, however, you must understand what a regular expression is and how to use it in searching for matches to a query.

There are a number of rules for regular expressions, and these eight constitute the most important:

Any non-special character will be equal to itself.

Any special character will be equal to itself if preceded by a backslash (\).

The beginning of the line can be signified by an up caret (^), and the end of the line by a dollar sign ($).

A range can be expressed within brackets ([]).

An up caret (^) as the first character of a bracket will find values not matching entries within the brackets.

A period (.) can signify any single character.

An asterisk (*) stands for anything and everything.

Quotation marks are not always needed around search strings, but can be needed, and should be used as a general rule.

Table 3.4 offers some examples and elaboration on each of the preceding rules.

Using Regular Expressions
Rule

Characters

Search Result

1

c (any character

Matches "c" anywhere within the line without a special purpose)

1

apple

Matches "apple" anywhere within the line

2

$

Every line that has a carriage return (every line)

2

\$

Every line that contains a dollar sign

3

^c

Every line that begins with the character "c"

3

c$

Every line that ends with the character "c"

4

[apple]

Every line that has an "a", "p", "l", or "e" (because the brackets are interpreted as a range, the second occurrence of the "p" is completely ignored)

4

[a-z]

Any lowercase letter

4

[:lower:]

Any lowercase letter. Other special values that can be used include [:alnum;], [:alpha:], [:digit:], [:upper:]

5

[^a-z]

Everything but lowercase letters

5

[^0-9]

Anything but digits

6

c.

Two-letter words beginning with "c"

6

c..$

Three-letter words at the end of the line that begin with "c"

7

c*

Any word beginning with "c" (and including just "c")

8

"c*"

Any word beginning with "c" (and including just "c")

8

"c apple"

The letter "c" followed by a space and the word "apple"


To illustrate some of these operations using grep, assume there is a small file named garbage with the following contents:

I heard about the cats last night
and particularly the one cat that
ran away with all the catnip
If you want to find all occurrences of the word cat, the syntax becomes

$ grep "cat" garbage
I heard about the cats last night
and particularly the one cat that
ran away with all the catnip
$
In this instance, the three-letter sequence "cat" appears in every line. Not only is there "cat" in the second line, but also "cats" in the first, and "catnip" in the second—all matching the character sequence specified. If you are interested in "cat" but not "cats", the syntax becomes

$ grep "cat[^s]" garbage
and particularly the one cat that
ran away with all the catnip
$
This specifically removes a four-letter sequence of "cats" while finding all other three-letter sequences of "cat". If we truly were only interested in "cat" and no deviations thereof, there are a couple of other possibilities to explore. The first method would be to include a space at the end of the word and within the quotation mark:

$ grep "cat " garbage
and particularly the one cat that
$
This finds four-letter combinations equal to that given—meaning that nothing must follow. The only problem (and it is a big one) that if the word given is the last entry in a line of a large file, it would not necessarily be followed by a space, and thus not be returned in the display. Another possibility is to eliminate "s" and "n" from the return display:

$ grep "cat[^sn]" garbage
and particularly the one cat that
$
This accomplishes the task, but would not catch other words where the fourth character differed from an "s" or "n". To eliminate all fourth characters, it is better to use

$ grep "cat[^A-z]" garbage
I heard about the cats last night
and particularly the one cat that
ran away with all the catnip
$
This removes both the upper- and lowercase character sets.

Options for grep
The default action for grep is to find matches within lines to strings specified and print them. This can be useful for pulling out key data from a large display, for example, to find what port user karen is on:

NOTE

The who command, used in the following example, merely shows who is logged on to the system.

$ who | grep "karen"
karen  pts/2  Aug 21 13:42
$
From all the output of the who command, only those having the string "karen" are displayed. As useful as this is, there are times when the actual output is not as important as other items surrounding it. For example, if you don't care where karen is logged on, but want to know how many concurrent sessions she has, you can use the -c option:

$ who | grep -c "karen"
1
$
NOTE

You can also modify it to use a logical "or" (||) to tell you if the user has not come in yet if the operation fails:

who | grep "karen" || echo "She has not come in yet"
The -c option is used to count the number of lines that would be displayed, but the display itself is suppressed. The -c option is basically performing the same task as this:

$ who | grep "karen" | wc -l
1
$
But the –c option is much quicker and more efficient by shaving an additional operation from the process. Other options that can be used with grep include the following:

-f uses a file to find the strings to search for.

-H includes in the display a default header of the filename from which the match is coming (if applicable) that appears at the beginning of each line, whereas -h prevents the header from appearing (the default).

-i to ignore case. If this option did not exist, you would conceivably have to search a text file for "karen", "Karen", "KAREN", and all deviations thereof to find all matches.

-L prints filenames that do not contain matches, whereas -l prints filenames that do contain matches.

-n to show line numbers in the display. This differs from numbering the lines (which nl can do) because the numbers that appear denote the numbering of the lines in the existing file, not the output.

-q quiets all output and is usually just used for testing return conditions.

-s prevents any errors that occur from displaying error messages. This is useful if you do not want errors stating that you have inadequate permissions to view something when scanning all directories for a match.

-v serves as the "not" option. It produces the opposite display of what not using it would. For example who | grep -v karen will show all users who are not karen.

-w forces the found display to match the whole word. This provides the best solution to the earlier discussion of finding "cat" but no derivatives thereof.

-x forces the found display to match the whole line.

The options can be mixed and matched, as desired, as long as one parameter does not cancel another. For example, it is possible to search for whole words and include header information by using either -wH or -Hw. You cannot, however, say that you want to see line numbers, and only see a final count (-nc) as the two options cancel each other out.

Some examples of how these options can be used follow. For the first, assume that we want to get a long list (ls -l) of the subdirectories beneath the current directory, and have no interest in actual filenames. Within the output of ls -l, the first field shows the permissions of the entry. If the entry is a file, the first character is "-", and if it is a directory, it is "d". Thus the command would be

ls -l | grep "^d"
If you want to know how many words are in the spelling dictionary, you can find out in a number of ways, including

wc -l /usr/share/dict/words
or

grep -c "." /usr/share/dict/words
Both of these generate a number based on the number of lines within the file. Suppose, however, you want to find out only how many words there are that start with the letter "c" (upper- or lowercase):

grep -ic "^c" /usr/share/dict/words
Or you want to find words that are in the last half of the alphabet:

grep -ic "^[n-z]" /usr/share/dict/words
NOTE

The preceding example could also be expressed as

grep -ci "^[^a-m]" /usr/share/dict/words
or

grep -vci "^[a-m]" /usr/share/dict/words
Suppose you have a number of different strings you want to find, not just one. You can search for them individually, requiring a number of operations, or you can place the search criteria in a text file, and input it into grep using the -f option. The following example assumes that the file wishlist already exists:

$ cat wishlist
dog
cat
fish
$
$ grep -if wishlist /usr/share/dict/words
Approximately 450 lines are displayed as all matches of all combinations of the three words are displayed. You can also continue one line to another to input multiple search sequences by using an uneven number of quotation marks to put the shell into input mode (PS2 prompt):

$ grep -ic "dog
> cat
> fish" /usr/share/dict/words
457
$
fgrep
The first attempt to greatly enhance grep was fgrep—as in either "file grep" or "fast grep." This utility was created under the days of Unix and prior to Linux. It enhanced grep by adding the ability to search for more than one item at a time (something that grep has since gained with the -f option). The tradeoff for gaining the ability to search for multiple items was an inability to use regular expressions—never mind what the acronym stood for.

Adding the additional functionality to grep, and facing the inability to use regular expressions here, fgrep still exists but is rarely used in place of grep. In fact, one of the options added to grep is -F, which forces it to interpret the string as a simple string and work the same way fgrep would (the default action is -G for basic regular expressions).

NOTE

For most practical purposes, grep -F is identical to fgrep.

egrep
The second attempt to enhance grep was egrep—as in "extended grep." This utility combined the features of grep with those of fgrep—keeping the use of regular expressions. You can specify multiple values that you want to look for within a file (-f), on separate lines (using uneven numbers of quotation marks), or by separating with the pipe symbol (|). For example

$ cat names
Jan May
Bob Mays
Shannon Harris
Richard Harriss
William Harrisson
Jim Buck
$
$ egrep "Jim|Jan" names
Jan May
Jim Buck
$
It also added two new variables:

? to mean zero or one

+ to mean one or more

Assume you can't recall how Jan spells her last name—is it May or Mays? Not knowing if there really is an "s" on the end, you can ask for zero or one occurrences of this character:

$ egrep "Mays?" names
Jan May
Bob Mays
$
Even though there was no "s", Jan's entry was found—as were those that contained the character in question. With the plus sign (+), you know that at least one iteration of the character exists, but are not certain if there are more. For example, does Shannon spell her last name with one "s" or two?

$ egrep "Harris+" names
Shannon Harris
Richard Harriss
William Harrisson
$
To look for values of greater length than one character, you can enclose text within parentheses "( )". For example, if you want to find Harriss and Harrisson (not Harris), the "on" becomes an entity with zero or more occurrences:

$ egrep "Harriss(on)?" names
Richard Harriss
William Harrisson
$
Since the creation of egrep—again in the days of Unix—most of the features have been worked into the version of grep included with Linux. Using the -E option with grep, you can get most of the functionality of egrep.

NOTE

For most practical purposes, grep -E is identical to egrep.

sed
The name sed is an acronym for "stream editor," and it is as accurate a description as any possible. The utility accepts input (text) to flow through it like a river or stream. It edits that text and makes changes to it based on the parameters it has been given. The syntax for sed is

sed {options} {commands} filename
The number of options is very limited, but the commands are more numerous. These are the accepted options:

-e to specify a command to execute

-f to specify a file in which to find commands to execute

-n to run in quiet mode

As mentioned, the commands are plentiful and the best way to understand them is to examine a few examples. One of the simplest commands to give it, or any editor, is to substitute one string for another. This is accomplished with the s (for substitute) command, followed by a slash, the string to find, a slash, the string to replace the first value with, and another slash. The entire entry should be placed within single quotes:

's/old value/new value/'
Thus an example would be

$ echo I think the new neighbors are getting a new air pump |
  sed 's/air/heat/'
And the output displayed becomes

I think the new neighbors are getting a new heat pump
The string "air" was replaced by the string "heat". Text substitution is one of the simplest features of sed, and also what it is used for most of the time, for it is here that its power and features truly excel.

NOTE

It is important to recognize that sed is not an interactive editor. When you give it the commands to execute, it does so without further action on the user's part.

By default, sed works by scanning each line in successive order. If it does not find a match after searching the entire line, it moves to the next. If it finds a match for the search text within the line, it makes the change and immediately moves on to examine the next line. This can provide some unexpected results. For example, suppose you want to change the word "new" to "old":

$ echo I think the new neighbors are getting a new air pump |
  sed 's/new/old/'
The output becomes

I think the old neighbors are getting a new air pump
The second occurrence of the search phrase is not changed because sed finished with the line after finding the first match. If you want all occurrences within the line to be changed, you must make the search global by using a "g" at the end of the command:

$ echo I think the new neighbors are getting a new air pump |
  sed 's/new/old/g'
The output becomes

I think the old neighbors are getting an old air pump
The global parameter makes sed continue to search the remainder of the line, after finding a match, until the end of the line is reached. This handles multiple occurrences of the same phrase, but you must go in a different direction if you want to change more than one value.

This can be accomplished on the command line with the -e option used to flag every command. For example

$ echo this line is shorter | sed -e 's/line/phrase/' -e
  's/short/long/'
The output becomes

this phrase is longer
The second method is to place all the substitution criteria in a file and then summon the file with the -f option. The following example assumes the file wishlist2 already exists:

$ cat wishlist2
s/line/phrase/
s/short/long/
$
$ echo this line is shorter | sed -f wishlist2
this phrase is longer
$
Both produce the same results, but using a file to hold your commands makes it exponentially easier to change criteria if you need to, keep track of the operations you are performing, and rerun the commands again at a later time.

NOTE

It is important to understand that when there are multiple operations, sed completes all of the first one first before starting the second. It completes all of the second—for the entire file, line, and so on—before starting the third, and so on.

Technically (but this is not recommended), you could also complete the same operation by using semicolons after each substitution, or going into input mode through an uneven number of single quotes:

$ echo this line is shorter | sed 's/line/phrase/;
  s/short/long/'
or

$ echo this line is shorter | sed '
> s/line/phrase/
> s/short/long/'
Restricting Lines
If you have a large file, you may not want to edit the entire file, but only selected lines. As with everything, there are a number of ways you can go about addressing this issue. The first method would be to specify a search clause and only make substitutions when the search phrase is found. The search phrase must precede the operation and be enclosed within slashes:

/search phrase/ {operation}
For example, assume you have the following file named tuesday:

one  1
two  1
three  1
three  1
two  1
one  1
You want the end result displayed to be

one  1
two  2
three  3
three  3
two  2
one  1
In this case, the search should be for the word two. If it is found, any 1s should be converted to 2. Likewise, there should be a search for the word three with any 1s following converted to 3:

$ cat wishlist3
/two/ s/1/2/
/three/ s/1/3/
$
$ sed -f wishlist3 tuesday
one 1
two 2
three 3
three 3
two 2
one 1
$
NOTE

By design, sed changes only the values displayed, not the values in the original file. To keep the results of the change, you must redirect output to a file, like this

sed -f wishlist3 tuesday > wednesday
Sometimes the actions specified should take place not on the basis of a search, but purely on the basis of the line number. For example, you may only want to change the first line of a file, the first 10 lines, or any other combination. When that is the case, you can specify a range of lines using this syntax:

First line,last line
To change all non-"one"s to that value in the first five lines, use this example:

$ cat wishlist4
1,5 s/two/one/
1,5 s/three/one
$
$ sed -f wishlist4 tuesday
one 1
one 1
one 1
one 1
one 1
one 1
$
Printing
The default operation for sed is simply to print. If you gave no parameters at all, an entire file would be displayed without any editing truly occurring. The one caveat to this is that sed must always have an option or command. If you simply gave the name of a file following sed and nothing more, it would misinterpret it as a command.

You can give a number of lines for it to display, however, followed with the q command (for quit), and see the contents of any file. For example

sed 75000q /usr/share/dict/words
will display all of the words file because the number of lines within the file is far less than 75,000. If you only want to see the first 60, you can use

sed 60q /usr/share/dict/words
The antithesis, or opposite, of the default action is the -n option. This prevents lines from displaying that normally would. For example

sed -n 75000q /usr/share/dict/words
will display nothing. Although it may seem foolish for a tool designed to display data to have an option that prevents displaying data, it can actually be a marvelous thing. To put it into perspective, though, you must know that the p command exists and is used to force a print (the default action).

In an earlier example, the file tuesday had numbers changed to match their alpha counterpart. Some of the lines in the file were already correct, and did not need to be changed. If the lines aren't being changed, do you really need to see their output—or are you only interested in the lines being changed? If the latter is the case, the following will display only those entries:

$ cat wishlist5
/two/ s/1/2/p
/three/ s/1/3/p
$
$ sed -nf wishlist5 tuesday
two 2
three 3
three 3
two 2
$
To print only lines 200 to 300 of a file (no actual editing taking place), the command becomes

sed -n '200,300p' /usr/share/dict/words
Deleting
The d command is used to specify deletion. As with all sed operations, the line is deleted from the display only, and not from the original file. The syntax is always

{the specification} d
Use this example to delete all lines that have the string "three":

$ sed '/three/ d' tuesday
one 1
two 1
two 1
one 1
$
To delete lines 1 through 3, try this:

$ sed '1,3 d' tuesday
three 1
two 1
one 1
$
NOTE

In all cases, the d command forces the deletion of the entire line. You cannot use this command to delete a portion of a line or a word. If you want to delete a word, substitute the word for nothing:

sed 's/word//'
Appending Text
Strings can be appended to a display using the a command. Append always places the string at the end of the specification—after the other lines. For example, to place a string equal to "The End" as the last line of the display, the command would be

$ sed '$a\
The End\' tuesday
one 1
two 1
three 1
three 1
two 1
one 1
The End
$
The dollar sign ($) is used to specify the end of the file, while the backslashes (\) are used to add a carriage return. If the dollar sign is left out of the command, the text is appended after every line:

$ sed 'a\
The End\' tuesday
one 1
The End
two 1
The End
three 1
The End
three 1
The End
two 1
The End
one 1
The End
$
If you want the string appended following a certain line number, add the line number to the command:

$ sed '3a\
Not The End\' tuesday
one 1
two 1
three 1
Not The End
three 1
two 1
one 1
$
In place of the append, you can use the insert (i) command to place strings before the given line versus after:

$ sed '3i\
Not The End\' tuesday
one 1
two 1
Not The End
three 1
three 1
two 1
one 1
$
Other Commands
There are a few other commands that can be used with sed to add additional functionality. The following list shows some of these commands:

b can be used to branch execution to another part of the file specifying what commands to carry out.

c is used to change one line to another. Whereas "s" will substitute one string for another, "c" changes the entire line:

$ sed '/three/ c\
four 4' tuesday
one 1
two 1
four 4
four 4
two 1
one 1
$
r can be used to read an additional file and append text from a second file into the display of the first.

w is used to write to a file. Instead of using the redirection (> filename), you can use w filename:

$ cat wishlist6
/two/ s/two/one/
/three/ s/three/one/
w friday
$
sed -f wishlist6 tuesday
one 1
one 1
one 1
one 1
one 1
one 1
$
$ cat friday
one 1
one 1
one 1
one 1
one 1
one 1
$
y is used to translate one character space into another.

A Word to the Wise
When working with sed and performing more than one change, substitution, or deletion operation, you must be very careful about the order that you specify operations to take place in. Using the example of the tuesday file used earlier, assume that the numbering is off. You want to change all ones to two, all twos to three, and all threes to four. Here's one way to approach it, with the results it generates:

$ cat wishlist7
s/one/two/
s/two/three/
s/three/four/
$
$ sed -f wishlist7 tuesday
four 1
four 1
four 1
four 1
four 1
four 1
$
The results are neither what was expected, or what was wanted. Because sed goes through the entire file and does the first operation, it would have first changed the file to

two 1
two 1
three 1
three 1
two 1
two 1
After that, sed will go through the file again, to perform the second operation. This causes the file to change to

three 1
three 1
three 1
three 1
three 1
three 1
When it comes time for the third iteration through the file, every line is now the same and will be changed—completely defeating the purpose. This is one of the reasons that sed's changing the display only and not the original file is a blessing.

In order to solve the problem, you need only realize what is transpiring, and arrange the order of operations to prevent this from happening:

$ cat wishlist8
s/three/four/
s/two/three/
s/one/two/
$
$ sed -f wishlist8 tuesday
two 1
three 1
four 1
four 1
three 1
one 1
$
Other Useful Utilities
A number of other useful text utilities are included with Linux. Some of these have limited usefulness and are intended only for a specific purpose, but knowing of their existence and purpose can make your life with Linux considerably easier.

In alphabetical order, the additional utilities are as follows:

awk—Often used in conjunction with sed, awk is a utility/language/tool with an enormous amount of flexibility for manipulating text. The GNU implementation of the utility is called gawk (as in GNU awk) but is truly the same.

cmp—Compares two files and reports if they are the same, or where the first difference occurs in terms of line number and character.

comm—Looks at two files and reports what they have in common.

diff—Allows you to see the differences between two files.

diff3—Similar to diff, but works with more than two files.

perl—A programming language useful for working with text.

regexp—A utility that can compare a regular expression against a string to see if there is a match. If there is a match, a value of 1 is returned and if there is no match, a value of 0 is returned.

0 comments:

Post a Comment