Using grep and regular expression to search files and data in Linux

Writing Regular Expressions

Regular expressions provide a pattern matching mechanism that facilitates finding specific content. The vim , grep, and less commands can all use regular expressions. Programming languages such as Perl, Python, and C can all use regular expressions when using pattern matching criteria.

Regular expressions are a language of their own, which means they have their own syntax and rules. This section looks at the syntax used when creating regular expressions, as well as showing some regular expression examples

Describing a Simple Regular Expression

The simplest regular expression is an exact match. An exact match is when the characters in the regular expression match the type and order in the data that is being searched. Suppose a user is looking through the following file looking for all occurrences of the pattern cat:

cat
dog
concatenate
dogma
category
educated
boondoggle
vindication
chilidog

cat is an exact match of a c, followed by an a, followed by a t with no other characters in between. Using cat as the regular expression to search the previous file returns the following matches:


cat
concatenate
category
educated
vindication

Matching the Start and End of a Line

The previous section used an exact match regular expression on a file. Note that the regular expression would match the search string no matter where on the line it occurred: the beginning, end, or middle of the word or line. Use a line anchor to control the location of where the regular expression looks for a match.

To search at the beginning of a line, use the caret character (^). To search at the end of a line, use the dollar sign ($). Using the same file as above, the ^cat regular expression would match two words. The $cat regular expression would not find any matching words.


cat
dog
concatenate
dogma
category
educated
boondoggle
vindication
chilidog

To locate lines in the file ending with dog, use that exact expression and an end of line anchor to create the regular expression dog$. Applying dog$ to the file would find two matches:


dog
chilidog

To locate the only word on a line, use both the beginning and end-of-line anchors. For example, to locate the word cat when it is the only word on a line, use ^cat$.


cat dog rabbit
cat
horse cat cow
cat pig

Adding Wildcards and Multipliers to Regular Expressions

Regular expressions use a period or dot (.) to match any single character with the exception of the newline character. A regular expression of c.t searches for a string that contains a c followed by any single character followed by a t. Example matches include cat, concatenate, vindication, c5t, and c$t. Using an unrestricted wildcard you cannot predict the character that would match the wildcard. To match specific characters, replace the unrestricted wildcard with acceptable characters. Changing the regular expression to c[aou]t matches patterns that start with a c, followed by either an a, o, or u, followed by a t.

Multipliers are a mechanism used often with wildcards. Multipliers apply to the previous character in the regular expression. One of the more common multipliers used is the asterisk, or star character (*). When used in a regular expression, this multiplier means match zero or more of the previous expression. You can use * with expressions, not just characters. For example, c[aou]*t. A regular expression of c.*t matches cat, coat, culvert, and even ct (zero characters between the c and the t). Any data starting with a c, then zero or more characters, ending with a t.

Another type of multiplier would indicate the number of previous characters desired in the pattern. An example of using an explicit multiplier would be ‘c.\{2\}t’. This regular expression will match any word beginning with a c, followed by exactly any two characters, and ending with a t. ‘c.\{2\}t’ would match two words in the example below:


cat
coat convert
cart covert
cypher

Regular Expressions

OPTION DESCRIPTION
. The period (.) matches any single character.
? The preceding item is optional and will be matched at most once.
* The preceding item will be matched zero or more times.
+ The preceding item will be matched one or more times.
{n} The preceding item is matched exactly n times.
{n,} The preceding item is matched n or more times.
{,m} The preceding item is matched at most m times.
{n,m} The preceding item is matched at least n times, but not more than m times.
[:alnum:] Alphanumeric characters: ‘[:alpha:]’ and ‘[:digit:]’; in the ‘C’ locale and ASCII character encoding, this is the same as ‘[0-9A-Za-z]’.
[:alpha:] Alphabetic characters: ‘[:lower:]’ and ‘[:upper:]’; in the ‘C’ locale and ASCII character encoding, this is the same as ‘[A-Za-z]’.
[:blank:] Blank characters: space and tab.
[:cntrl:] Control characters. In ASCII, these characters have octal codes 000 through 037, and 177 (DEL). In other character sets, these are the equivalent characters, if any.
[:digit:] Digits: 0 1 2 3 4 5 6 7 8 9.
[:graph:] Graphical characters: ‘[:alnum:]’ and ‘[:punct:]’.
[:lower:] Lower-case letters; in the ‘C’ locale and ASCII character encoding, this is a b c d e f g h i j k l m n o p q r s t u v w x y z.
[:print:] Printable characters: ‘[:alnum:]’, ‘[:punct:]’, and space.
[:punct:] Punctuation characters; in the ‘C’ locale and ASCII character encoding, this is! " # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ’ {
[:space:] Space characters: in the ‘C’ locale, this is tab, newline, vertical tab, form feed,carriage return, and space.
[:upper:] Upper-case letters: in the ‘C’ locale and ASCII character encoding, this is A B C D E F G H I J K L M N O P Q R S T U V W X Y Z.
[:xdigit:] Hexadecimal digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f.
\b Match the empty string at the edge of a word.
\B Match the empty string provided it is not at the edge of a word.
\< Match the empty string at the beginning of word.
\> Match the empty string at the end of word.
\w Match word constituent. Synonym for ‘[_[:alnum:]]’.
\W Match non-word constituent. Synonym for ‘[^_[:alnum:]]’.
\s Match white space. Synonym for ‘[[:space:]]’.
\S Match non-whitespace. Synonym for ‘[^[:space:]]’.

Matching Regular Expressions with grep

The grep command, provided as part of the distribution, uses regular expressions to isolate matching data.

Isolating data using the grep command

The grep command provides a regular expression and a file on which the regular expression should be matched.

[user@host ~]$ grep '^computer' /usr/share/dict/words
computer
computerese
computerise
computerite
computerizable
computerization
computerize
computerized
computerizes
computerizing
computerlike
computernik
computers

The grep command can be used in conjunction with other commands using a pipe operator (|). For example:


[root@host ~]# ps aux | grep chrony
chrony     662  0.0  0.1  29440  2468 ?        S    10:56   0:00 /usr/sbin/chronyd

grep Options

The grep command has many useful options for adjusting how it uses the provided regular expression with data.

OPTION FUNCTION
-i Use the regular expression provided but do not enforce case sensitivity (run case-insensitive).
-v Only display lines that do not contain matches to the regular expression.
-r Apply the search for data matching the regular expression recursively to a group of files or directories.
-A NUMBER Display NUMBER of lines after the regular expression match.
-B NUMBER Display NUMBER of lines before the regular expression match.
-e With multiple -e options used, multiple regular expressions can be supplied and will be used with a logical OR.

There are many other options to grep. Use the man page to research them.

# man grep

grep Examples

The next examples use varied configuration files and log files. Regular expressions are case-sensitive by default. Use the -i option with grep to run a case insensitive search. The following example searches for the pattern serverroot.

[user@host ~]$ cat /etc/httpd/conf/httpd.conf
...output omitted...
ServerRoot "/etc/httpd"

#
# Listen: Allows you to bind Apache to specific IP addresses and/or
# ports, instead of the default. See also the
# directive.
#
# Change this to Listen on specific IP addresses as shown below to
# prevent Apache from glomming onto all bound IP addresses.
#
#Listen 12.34.56.78:80
Listen 80

[user@host ~]$ grep -i serverroot /etc/httpd/conf/httpd.conf
# with "/", the value of serverroot is prepended -- so 'log/access_log'
# with serverroot set to '/www' will be interpreted by the
# serverroot: The top of the directory tree under which the server's
# serverroot at a non-local disk, be sure to specify a local disk on the
# same serverroot for multiple httpd daemons, you will need to change at
serverroot "/etc/httpd"

In cases where you know what you are not looking for, the -v option is very useful. The -v option only displays lines that do not match the regular expression. In the following example, all lines, regardless of case, that do not contain the regular expression server are returned.

[user@host ~]$ cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

172.25.254.254  classroom.example.com classroom
172.25.254.254  content.example.com content
172.25.254.254  materials.example.com materials
172.25.250.254  workstation.lab.example.com workstation
### rht-vm-hosts file listing the entries to be appended to /etc/hosts

172.25.250.10   servera.lab.example.com servera
172.25.250.11   serverb.lab.example.com serverb
172.25.250.254  workstation.lab.example.com workstation
[user@host ~]$ grep -v -i server /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

172.25.254.254  classroom.example.com classroom
172.25.254.254  content.example.com content
172.25.254.254  materials.example.com materials
172.25.250.254  workstation.lab.example.com workstation
### rht-vm-hosts file listing the entries to be appended to /etc/hosts

172.25.250.254  workstation.lab.example.com workstation

To look at a file without being distracted by comment lines use the -v option. In the following example, the regular expression matches all lines that begin with a # or ; (typical characters that indicate the line will be interpreted as a comment). Those lines are then omitted from the output.

[user@host ~]$ cat /etc/ethertypes
#
# Ethernet frame types
#       This file describes some of the various Ethernet
#       protocol types that are used on Ethernet networks.
#
# This list could be found on:
#         http://www.iana.org/assignments/ethernet-numbers
#         http://www.iana.org/assignments/ieee-802-numbers
#
#      ...
#Comment # IPv4        0800    ip ip4
# Internet IP (IPv4) X25     0805 ARP     0806    ether-arp
# FR_ARP      0808
# Frame Relay ARP        [RFC1701]
[user@host ~]$ grep -v '^[#;]' /etc/ethertypes
IPv4        0800    ip ip4      # Internet IP (IPv4)
X25     0805
ARP     0806    ether-arp   #
FR_ARP      0808            # Frame Relay ARP        [RFC1701]

The grep command with the -e option allows you to search for more than one regular expression at a time. The following example, using a combination of less and grep, locates all occurrences of pam_unix, user root and Accepted publickey in the /var/log/secure log file.

[root@host ~]# cat /var/log/secure | grep -e 'pam_unix' -e 'user root' -e 'Accepted publickey' | less
Mar 19 08:04:46 jegui sshd[6141]: pam_unix(sshd:session): session opened for user root by (uid=0)
Mar 19 08:04:50 jegui sshd[6144]: Disconnected from user root 172.25.250.254 port 41170
Mar 19 08:04:50 jegui sshd[6141]: pam_unix(sshd:session): session closed for user root
Mar 19 08:04:53 jegui sshd[6168]: Accepted publickey for student from 172.25.250.254 port 41172 ssh2: RSA SHA256:M8ikhcEDm2tQ95Z0o7ZvufqEixCFCt+wowZLNzNlBT0

To search for text in a file opened using vim or less, use the slash character (/) and type the pattern to find. Press Enter to start the search. Press N to find the next match.

[root@host ~]# vim /var/log/boot.log
...output omitted...
[^[[0;32m  OK  ^[[0m] Reached target Initrd Default Target.^M
Starting dracut pre-pivot and cleanup hook...^M
[^[[0;32m  OK  ^[[0m] Started dracut pre-pivot and cleanup hook.^M
Starting Cleaning Up and Shutting Down Daemons...^M
Starting Plymouth switch root service...^M
Starting Setup Virtual Console...^M
[^[[0;32m  OK  ^[[0m] Stopped target Timers.^M
[^[[0;32m  OK  ^[[0m] Stopped dracut pre-pivot and cleanup hook.^M
[^[[0;32m  OK  ^[[0m] Stopped target Initrd Default Target.^M
[root@host ~]# less /var/log/messages
...output omitted...
Feb 26 15:51:07 jegui NetworkManager[689]:   [1551214267.8584] Loaded device plugin: NMTeamFactory (/usr/lib64/NetworkManager/1.14.0-14.el8/libnm-deviceplugin-team.so)
Feb 26 15:51:07 jegui NetworkManager[689]:   [1551214267.8599] device (lo): carrier: link connected
Feb 26 15:51:07 jegui NetworkManager[689]:   [1551214267.8600] manager: (lo): new Generic device (/org/freedesktop/NetworkManager/Devices/1)
Feb 26 15:51:07 jegui NetworkManager[689]:   [1551214267.8623] manager: (ens3): new Ethernet device (/org/freedesktop/NetworkManager/Devices/2)
Feb 26 15:51:07 jegui NetworkManager[689]:   [1551214267.8653] device (ens3): state change: unmanaged -> unavailable (reason 'managed', sys-ifacestate: 'external') /device