Using grep and regular expression to search files and data in Linux
Writing Regular Expressions
Regular expressions provide a pattern matching mechanism that facilitates finding specific content. The vim , grep, and less commands can all use regular expressions. Programming languages such as Perl, Python, and C can all use regular expressions when using pattern matching criteria.
Regular expressions are a language of their own, which means they have their own syntax and rules. This section looks at the syntax used when creating regular expressions, as well as showing some regular expression examples
Describing a Simple Regular Expression
The simplest regular expression is an exact match. An exact match is when the characters in the regular expression match the type and order in the data that is being searched. Suppose a user is looking through the following file looking for all occurrences of the pattern cat:
cat
dog
concatenate
dogma
category
educated
boondoggle
vindication
chilidog
cat is an exact match of a c, followed by an a, followed by a t with no other characters in between. Using cat as the regular expression to search the previous file returns the following matches:
cat
concatenate
category
educated
vindication
Matching the Start and End of a Line
The previous section used an exact match regular expression on a file. Note that the regular expression would match the search string no matter where on the line it occurred: the beginning, end, or middle of the word or line. Use a line anchor to control the location of where the regular expression looks for a match.
To search at the beginning of a line, use the caret character (^). To search at the end of a line, use the dollar sign ($). Using the same file as above, the ^cat regular expression would match two words. The $cat regular expression would not find any matching words.
cat
dog
concatenate
dogma
category
educated
boondoggle
vindication
chilidog
To locate lines in the file ending with dog, use that exact expression and an end of line anchor to create the regular expression dog$. Applying dog$ to the file would find two matches:
dog
chilidog
To locate the only word on a line, use both the beginning and end-of-line anchors. For example, to locate the word cat when it is the only word on a line, use ^cat$.
cat dog rabbit
cat
horse cat cow
cat pig
Adding Wildcards and Multipliers to Regular Expressions
Regular expressions use a period or dot (.) to match any single character with the exception of the newline character. A regular expression of c.t searches for a string that contains a c followed by any single character followed by a t. Example matches include cat, concatenate, vindication, c5t, and c$t. Using an unrestricted wildcard you cannot predict the character that would match the wildcard. To match specific characters, replace the unrestricted wildcard with acceptable characters. Changing the regular expression to c[aou]t matches patterns that start with a c, followed by either an a, o, or u, followed by a t.
Multipliers are a mechanism used often with wildcards. Multipliers apply to the previous character in the regular expression. One of the more common multipliers used is the asterisk, or star character (*). When used in a regular expression, this multiplier means match zero or more of the previous expression. You can use * with expressions, not just characters. For example, c[aou]*t. A regular expression of c.*t matches cat, coat, culvert, and even ct (zero characters between the c and the t). Any data starting with a c, then zero or more characters, ending with a t.
Another type of multiplier would indicate the number of previous characters desired in the pattern. An example of using an explicit multiplier would be ‘c.\{2\}t’. This regular expression will match any word beginning with a c, followed by exactly any two characters, and ending with a t. ‘c.\{2\}t’ would match two words in the example below:
cat
coat convert
cart covert
cypher
Regular Expressions
OPTION | DESCRIPTION |
---|---|
. | The period (.) matches any single character. |
? | The preceding item is optional and will be matched at most once. |
* | The preceding item will be matched zero or more times. |
+ | The preceding item will be matched one or more times. |
{n} | The preceding item is matched exactly n times. |
{n,} | The preceding item is matched n or more times. |
{,m} | The preceding item is matched at most m times. |
{n,m} | The preceding item is matched at least n times, but not more than m times. |
[:alnum:] | Alphanumeric characters: ‘[:alpha:]’ and ‘[:digit:]’; in the ‘C’ locale and ASCII character encoding, this is the same as ‘[0-9A-Za-z]’. |
[:alpha:] | Alphabetic characters: ‘[:lower:]’ and ‘[:upper:]’; in the ‘C’ locale and ASCII character encoding, this is the same as ‘[A-Za-z]’. |
[:blank:] | Blank characters: space and tab. |
[:cntrl:] | Control characters. In ASCII, these characters have octal codes 000 through 037, and 177 (DEL). In other character sets, these are the equivalent characters, if any. |
[:digit:] | Digits: 0 1 2 3 4 5 6 7 8 9. |
[:graph:] | Graphical characters: ‘[:alnum:]’ and ‘[:punct:]’. |
[:lower:] | Lower-case letters; in the ‘C’ locale and ASCII character encoding, this is a b c d e f g h i j k l m n o p q r s t u v w x y z. |
[:print:] | Printable characters: ‘[:alnum:]’, ‘[:punct:]’, and space. |
[:punct:] | Punctuation characters; in the ‘C’ locale and ASCII character encoding, this is! " # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ’ { |
[:space:] | Space characters: in the ‘C’ locale, this is tab, newline, vertical tab, form feed,carriage return, and space. |
[:upper:] | Upper-case letters: in the ‘C’ locale and ASCII character encoding, this is A B C D E F G H I J K L M N O P Q R S T U V W X Y Z. |
[:xdigit:] | Hexadecimal digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f. |
\b | Match the empty string at the edge of a word. |
\B | Match the empty string provided it is not at the edge of a word. |
\< | Match the empty string at the beginning of word. |
\> | Match the empty string at the end of word. |
\w | Match word constituent. Synonym for ‘[_[:alnum:]]’. |
\W | Match non-word constituent. Synonym for ‘[^_[:alnum:]]’. |
\s | Match white space. Synonym for ‘[[:space:]]’. |
\S | Match non-whitespace. Synonym for ‘[^[:space:]]’. |
Matching Regular Expressions with grep
The grep command, provided as part of the distribution, uses regular expressions to isolate matching data.
Isolating data using the grep command
The grep command provides a regular expression and a file on which the regular expression should be matched.
[[email protected] ~]$ grep '^computer' /usr/share/dict/words
computer
computerese
computerise
computerite
computerizable
computerization
computerize
computerized
computerizes
computerizing
computerlike
computernik
computers
The grep command can be used in conjunction with other commands using a pipe operator (|). For example:
[[email protected] ~]# ps aux | grep chrony
chrony 662 0.0 0.1 29440 2468 ? S 10:56 0:00 /usr/sbin/chronyd
grep Options
The grep command has many useful options for adjusting how it uses the provided regular expression with data.
OPTION | FUNCTION |
---|---|
-i | Use the regular expression provided but do not enforce case sensitivity (run case-insensitive). |
-v | Only display lines that do not contain matches to the regular expression. |
-r | Apply the search for data matching the regular expression recursively to a group of files or directories. |
-A NUMBER | Display NUMBER of lines after the regular expression match. |
-B NUMBER | Display NUMBER of lines before the regular expression match. |
-e | With multiple -e options used, multiple regular expressions can be supplied and will be used with a logical OR. |
There are many other options to grep. Use the man page to research them.
# man grep
grep Examples
The next examples use varied configuration files and log files. Regular expressions are case-sensitive by default. Use the -i option with grep to run a case insensitive search. The following example searches for the pattern serverroot.
[[email protected] ~]$ cat /etc/httpd/conf/httpd.conf
...output omitted...
ServerRoot "/etc/httpd"
#
# Listen: Allows you to bind Apache to specific IP addresses and/or
# ports, instead of the default. See also the
# directive.
#
# Change this to Listen on specific IP addresses as shown below to
# prevent Apache from glomming onto all bound IP addresses.
#
#Listen 12.34.56.78:80
Listen 80
[[email protected] ~]$ grep -i serverroot /etc/httpd/conf/httpd.conf
# with "/", the value of serverroot is prepended -- so 'log/access_log'
# with serverroot set to '/www' will be interpreted by the
# serverroot: The top of the directory tree under which the server's
# serverroot at a non-local disk, be sure to specify a local disk on the
# same serverroot for multiple httpd daemons, you will need to change at
serverroot "/etc/httpd"
In cases where you know what you are not looking for, the -v option is very useful. The -v option only displays lines that do not match the regular expression. In the following example, all lines, regardless of case, that do not contain the regular expression server are returned.
[[email protected] ~]$ cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
172.25.254.254 classroom.example.com classroom
172.25.254.254 content.example.com content
172.25.254.254 materials.example.com materials
172.25.250.254 workstation.lab.example.com workstation
### rht-vm-hosts file listing the entries to be appended to /etc/hosts
172.25.250.10 servera.lab.example.com servera
172.25.250.11 serverb.lab.example.com serverb
172.25.250.254 workstation.lab.example.com workstation
[[email protected] ~]$ grep -v -i server /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
172.25.254.254 classroom.example.com classroom
172.25.254.254 content.example.com content
172.25.254.254 materials.example.com materials
172.25.250.254 workstation.lab.example.com workstation
### rht-vm-hosts file listing the entries to be appended to /etc/hosts
172.25.250.254 workstation.lab.example.com workstation
To look at a file without being distracted by comment lines use the -v option. In the following example, the regular expression matches all lines that begin with a # or ; (typical characters that indicate the line will be interpreted as a comment). Those lines are then omitted from the output.
[[email protected] ~]$ cat /etc/ethertypes
#
# Ethernet frame types
# This file describes some of the various Ethernet
# protocol types that are used on Ethernet networks.
#
# This list could be found on:
# http://www.iana.org/assignments/ethernet-numbers
# http://www.iana.org/assignments/ieee-802-numbers
#
# ...
#Comment # IPv4 0800 ip ip4
# Internet IP (IPv4) X25 0805 ARP 0806 ether-arp
# FR_ARP 0808
# Frame Relay ARP [RFC1701]
[[email protected] ~]$ grep -v '^[#;]' /etc/ethertypes
IPv4 0800 ip ip4 # Internet IP (IPv4)
X25 0805
ARP 0806 ether-arp #
FR_ARP 0808 # Frame Relay ARP [RFC1701]
The grep command with the -e option allows you to search for more than one regular expression at a time. The following example, using a combination of less and grep, locates all occurrences of pam_unix, user root and Accepted publickey in the /var/log/secure log file.
[[email protected] ~]# cat /var/log/secure | grep -e 'pam_unix' -e 'user root' -e 'Accepted publickey' | less
Mar 19 08:04:46 jegui sshd[6141]: pam_unix(sshd:session): session opened for user root by (uid=0)
Mar 19 08:04:50 jegui sshd[6144]: Disconnected from user root 172.25.250.254 port 41170
Mar 19 08:04:50 jegui sshd[6141]: pam_unix(sshd:session): session closed for user root
Mar 19 08:04:53 jegui sshd[6168]: Accepted publickey for student from 172.25.250.254 port 41172 ssh2: RSA SHA256:M8ikhcEDm2tQ95Z0o7ZvufqEixCFCt+wowZLNzNlBT0
To search for text in a file opened using vim or less, use the slash character (/) and type the pattern to find. Press Enter to start the search. Press N to find the next match.
[[email protected] ~]# vim /var/log/boot.log
...output omitted...
[^[[0;32m OK ^[[0m] Reached target Initrd Default Target.^M
Starting dracut pre-pivot and cleanup hook...^M
[^[[0;32m OK ^[[0m] Started dracut pre-pivot and cleanup hook.^M
Starting Cleaning Up and Shutting Down Daemons...^M
Starting Plymouth switch root service...^M
Starting Setup Virtual Console...^M
[^[[0;32m OK ^[[0m] Stopped target Timers.^M
[^[[0;32m OK ^[[0m] Stopped dracut pre-pivot and cleanup hook.^M
[^[[0;32m OK ^[[0m] Stopped target Initrd Default Target.^M
[[email protected] ~]# less /var/log/messages
...output omitted...
Feb 26 15:51:07 jegui NetworkManager[689]: [1551214267.8584] Loaded device plugin: NMTeamFactory (/usr/lib64/NetworkManager/1.14.0-14.el8/libnm-deviceplugin-team.so)
Feb 26 15:51:07 jegui NetworkManager[689]: [1551214267.8599] device (lo): carrier: link connected
Feb 26 15:51:07 jegui NetworkManager[689]: [1551214267.8600] manager: (lo): new Generic device (/org/freedesktop/NetworkManager/Devices/1)
Feb 26 15:51:07 jegui NetworkManager[689]: [1551214267.8623] manager: (ens3): new Ethernet device (/org/freedesktop/NetworkManager/Devices/2)
Feb 26 15:51:07 jegui NetworkManager[689]: [1551214267.8653] device (ens3): state change: unmanaged -> unavailable (reason 'managed', sys-ifacestate: 'external') /device