Searching Text File Contents using grep
In an earlier post, we saw how the wc program can be used to count the characters, words, and lines in text files. In this post, we introduce the grep program, a handy tool for searching text file contents for specific words or character sequences. The name grep stands for general regular expression parser. What, you may well ask, is a regular expression and why on earth should I want to parse one? We will provide a more formal definition of regular expressions in a later post, but for now, it is enough to know that a regular expression is simply a way of describing a pattern, or template, to match some sequence of characters. A simple regular expression would be “Hello”, which matches exactly five characters: “H”, “e”, two consecutive “l” characters, and a final “o”. More powerful search patterns are possible and we shall examine them in the next section.
The syntax below gives the general form of the grep command line:
# [grep|fgrep|egrep] [-i] [-n] [-v] [-r] [-w] pattern [filename ...]
There are actually three different names for the grep tool:
- fgrep: Does a fast search for simple patterns. Use this command to quickly locate patterns without any wildcard characters, useful when searching for an ordinary word.
- grep: Pattern searches using ordinary regular expressions.
- egrep: Pattern searches using more powerful extended regular expressions.
The pattern argument supplies the template characters for which grep is to search. The pattern is expected to be a single argument, so if pattern contains any spaces, or other characters special to the shell, you must enclose the pattern in quotes to prevent the shell from expanding or word splitting it. The following table summarizes some of grep’s more commonly used command line switches. Consult the grep man page (or invoke grep –help) for more information.
Common Command Line Switches for the grep Command
|-c||Print a count of matching lines only.|
|-h||Suppress filename prefixes.|
|-e expression||Use expression as a search pattern. (Helpful for specifying several alternate patterns.)|
|-i||Ignore case when determining matches.|
|-l||Print filenames that contain matching pattern only.|
|-n||Include line numbers along with matching lines.|
|-q||“Quiet”. Do not write anything to standard out. Instead, exit with a zero exit status if any match is found.|
|-r||Search all files, recursing through directories.|
|-w||Only match whole words.|
|-C||Include two lines of context before and after the matched line.|
Show All Occurrences of a String in a File
Under Linux, there are often several ways of accomplishing the same task. For example, to see if a file contains the word “even”, you could just visually scan the file:
$ cat file This file has some words. It also has even more words.
Reading the file, we see that the file does indeed contain the letters “even”. Using this method on a large file suffers because we could easily miss one word in a file of several thousand, or even several hundred thousand, words. We can use the grep tool to search through the file for us in an automatic search:
$ grep even file It also has even more words.
Here we searched for a word using its exact spelling. Instead of just a literal string, the pattern argument can also be a general template for matching more complicated character sequences; we shall explore that in some other post.
Searching in Several Files at Once
An easy way to search several files is just to name them on the grep command line:
$ echo Every cat has one more tail than no cat. > general $ echo No cat has nine tails. > specific $ echo Therefore, every cat has ten tails. > fallacy
$ grep cat general specific fallacy general:Every cat has one more tail than no cat. specific:No cat has nine tails. fallacy:Therefore, every cat has ten tails.
Perhaps we are more interested in just discovering which file mentions the word “nine” than actually seeing the line itself. Adding the -l switch to the grep line does just that:
$ grep -l nine general specific fallacy specific
Searching Directories Recursively
Grep can also search all the files in a whole directory tree with a single command. This can be handy when working a large number of files. The easiest way to understand this is to see it in action. In the directory /etc/sysconfig are text files that contain much of the configuration information about a Linux system. The Linux name for the first Ethernet network device on a system is “eth0”, so you can find which file contains the configuration for eth0 by letting the grep -r command do the searching for you:
$ grep -r eth0 /etc/sysconfig 2>/dev/null /etc/sysconfig/network-scripts/ifup-aliases:# Specify multiple ranges using multiple files, such as ifcfg-eth0-range0 and /etc/sysconfig/network-scripts/ifup-aliases:# ifcfg-eth0-range1, etc. In these files, the following configuration variables /etc/sysconfig/network-scripts/ifup-aliases:# The above example values create the interfaces eth0:0 through eth0:253 using /etc/sysconfig/network-scripts/ifup-ipv6:# Example: IPV6TO4_ROUTING="eth0-:f101::0/64 eth1-:f102::0/64" /etc/sysconfig/network-scripts/ifcfg-eth0:DEVICE=’eth0’ /etc/sysconfig/networking/devices/ifcfg-eth0:DEVICE=’eth0’ /etc/sysconfig/networking/profiles/default/ifcfg-eth0:DEVICE='eth0'
Every file in /etc/sysconfig that mentions eth0 is shown in the results. We can further limit the files listed to only those referring to an actual device by filtering the grep -r output through a grep DEVICE:
$ grep -r eth0 /etc/sysconfig 2>/dev/null | grep DEVICE /etc/sysconfig/network-scripts/ifcfg-eth0:DEVICE=’eth0’ /etc/sysconfig/networking/devices/ifcfg-eth0:DEVICE=’eth0’ /etc/sysconfig/networking/profiles/default/ifcfg-eth0:DEVICE=’eth0’
This shows a common use of grep as a filter to simplify the outputs of other commands. If only the names of the files were of interest, the output can be simplified with the -l command line switch.
$ grep -rl eth0 /etc/sysconfig 2>/dev/null /etc/sysconfig/network-scripts/ifup-aliases /etc/sysconfig/network-scripts/ifup-ipv6 /etc/sysconfig/network-scripts/ifcfg-eth0 /etc/sysconfig/networking/devices/ifcfg-eth0 /etc/sysconfig/networking/profiles/default/ifcfg-eth0
By default, grep shows only the lines matching the search pattern. Usually, this is what you want, but sometimes you are interested in the lines that do not match the pattern. In these instances, the -v command line switch inverts grep’s operation.
$ head -n 4 /etc/passwd root:x:0:0:root:/root:/bin/bash bin:x:1:1:bin:/bin: daemon:x:2:2:daemon:/sbin adm:x:3:4:adm:/var/adm:
$ grep -v root /etc/passwd | head -n 3 bin:x:1:1:bin:/bin: daemon:x:2:2:daemon:/sbin: adm:x:3:4:adm:/var/adm:
Getting Line Numbers
Often you may be searching a large file that has many occurrences of the pattern. Grep will list each line containing one or more matches, but how is one to locate those lines in the original file? Using the grep -n command will also list the line number of each matching line.
The file /usr/share/dict/words contains a list of common dictionary words. Identify which line contains the word “dictionary”:
$ fgrep -n dictionary /usr/share/dict/words 12526:dictionary
You might also want to combine the -n switch with the -r switch when searching all the files below a directory:
$ fgrep -nr dictionary /usr/share/dict linux.words:12526:dictionary
Limiting Matching to Whole Words
Lets have a look at out input data, which looks like below:
$ cat rhyme The cat sat on the mat at home.
Suppose we wanted to retrieve all lines containing the word “at”. If we try the command:
$ fgrep at rhyme The cat sat on the mat at home.
Do you see what happened? We matched the “at” string, whether it was an isolated word or part of a larger word. The grep command provides the -w switch to imply that the specified pattern should only match entire words.
$ grep -w at file at home.
The -w switch considers a sequence of letters, numbers, and underscore characters, surrounded by anything else, to be a word.
The string “Bob” has quite a meaning quite different from the string “bob”. However, sometimes we want to find either one, regardless of whether the word is capitalized or not. The grep -i command solves just this problem.
Look again at our nursery rhyme file:
$ cat rhyme The cat sat on the mat at home.
See if the file contains the word “the”, all in lowercase letters:
$ grep the rhyme the mat
Now see which lines contain the letters “t”, “h”, and “e” in any combination of lower- or upper-case letters:
$ grep -in the rhyme 1:The cat 3:the mat
Notice that we also used the -n switch to add the line numbers to the output.
Example 1. Finding Simple Character Strings
Verify that your computer has the system account “lp”, used for the line printer tools. Hint: the file /etc/passwd contains one line for each user account on the system.
$ grep lp /etc/passwd lp:x:4:7:lp:/var/spool/lpd:
Example 2. In That Case
Search for an exact copy of the pattern:
$ grep LP /etc/passwd $
Nothing was matched because the pattern does not match the case for the account name. Search again and ignore the case:
$ grep -i LP /etc/passwd lp:x:4:7:/var/spool/lpd:
Example 3. Matching Whole Words
We have seen that grep will match the pattern wherever the pattern is located, even in the middle of words. Search for the pattern “honey” in the system word dictionary /usr/share/dict/words:
$ grep honey /usr/share/dict/words honey honeybee honeycomb honeycombed honeydew honeymoon honeymooned honeymooner honeymooners honeymooning honeymoons honeysuckle Mahoney
Evidently, the dictionary contains several words using the string “honey” as a root word. We can limit the matching to whole words by using the grep -w command. The grep command considers a word to be a group of letters, digits, or underscores surrounded by anything else. The beginning and end of a line also qualifies as “anything else”, so the first or last word on a line is recognized correctly. Try to lookup “honey” in the dictionary again:
$ grep -w honey /usr/share/dict/words honey
For lack of a better word: perfect.
Example 4. Combining grep and xargs
Suppose that you have been placed in charge of maintaining the help file documentation for the vim editor. As you browse through the already existing files, you notice that in some places, the help files use the two words command line, and in other places the single word commandline. You would like the help files to be consistent, and decide the former is correct.
You would now like to find every occurrence of the text commandline, and change them to command line. You start by identifying which files contain the text commandline.
$ grep -ril commandline /usr/share/doc/vim* /usr/share/doc/vim-common-6.1/docs/message.txt /usr/share/doc/vim-common-6.1/docs/options.txt /usr/share/doc/vim-common-6.1/docs/os_risc.txt /usr/share/doc/vim-common-6.1/docs/tags /usr/share/doc/vim-common-6.1/docs/todo.txt /usr/share/doc/vim-common-6.1/docs/various.txt /usr/share/doc/vim-common-6.1/docs/version5.txt
You would now like to open each of these files in the gedit text editor, so that you can make your edits. You pipe the results of your search into the gedit command.
$ grep -ril commandline /usr/share/doc/vim* | gedit
The gedit editor opens, but with an empty buffer titled “untitled”. This is not what you had meant! You had wanted gedit to open the filenames that the grep command supplied on stdin, not stdin itself. Unfortunately, that’s not how gedit works. gedit (like most text editors) expects filenames to be supplied as arguments on the command line, not using stdin.
Fortunately, there is a standard Linux (and Unix) utility that helps in just such situations: xargs. The xargs command will read stdin, and append any words found there to the supplied command line, as additional arguments. Hopefully, the following example will clarify. With your knowledge of the xargs command, you modify your previous approach.
$ grep -ril commandline /usr/share/doc/vim* | xargs gedit
Now, the gedit editor opens up with multiple buffers, one for each file output by the grep command. Notice that you never had to type in the individual file names. The words supplied on stdin were exchanged for arguments on the command line, thus the name xargs. Nice.