Beginners Guide to Using "awk" in Shell Scripts

Introduction to the awk Programming Language

The awk programming language grew out of the recognition that many data processing problems are specialized applications of the concept of filtering, where the data is structured into records to which transformations are repetitively applied. The awk programming language is a record-oriented language that is named for its authors Aho, Weinberger, and Kernighan of AT&T Bell Labs.

The awk command added functionality to the awk programming language. This module describes the awk programming language. Unlike sed, awk looks at data by records and fields. By default, records are delimited by newline characters, and the fields within them are delimited by spaces or tabs, but these can be set to the delimiters that are built-in to your data, such as colons or commas.

Applications written in the awk programming language provide the following capabilities:

  • Filtering
  • Numerical processing on rows and columns of data
  • Text processing to perform repetitive editing tasks
  • Report generation

Using the awk programming language, you can develop programs within a script. Many programming concepts, such as conditionals, looping, variables, and functions, are included in the awk programming language. This module focuses on the basic concepts of awk, not including those programming concepts.

Format of the awk Command

The awk command has the following format:

awk 'statement' input.file

The statement in enclosed is single quotes and might be in one of three forms:

  • pattern { ACTION } - The action is taken on those records that match the pattern.
  • pattern - All records that match the pattern are printed.
  • { ACTION } - The action is taken on all records in the input file.

Executing awk Scripts

Execute awk scripts using this format:

$ awk -f scriptfile input.file

You can combine repetitive awk commands into awk scripts that consist of one or more lines of the form:

pattern { ACTION }

where the pattern is commonly a RE enclosed in slashes (/RE/) and ACTION is one or more statements of the awk language.

Using awk to Print Selected Fields

The print statement outputs data from the file. When awk reads a record, it divides the record into fields based on the FS (input field separator) variable. This variable is predefined in awk to be one or more spaces or tabs. The variables $1, $2, $3 hold the values of the first, second, and third fields. The variable $0 holds the value of the entire line.

In the following example, Field 2 (office), Field 3 (first name), and Field 4 (last name) are printed.

$ cat data.file
northwest       NW      Joel Craig      3.0 .98 3       4
western         WE      Sharon Kelly    5.3 .97 5       23
southwest       SW      Chris Foster    2.7 .8  2       18
southern        SO      May Chin        5.1 .95 4       15
southeast       SE      Derek Johnson   5.0 .70 4       17
eastern         EA      Susan Beal      4.4 .8  5       20
northeast       NE      TJ Nichols      5.1 .94 3       13
north           NO      Val Shultz      4.5 .89 5       9
central         CT      Sheri Watson    5.7 .94 5       13
$ awk '{ print $3, $4, $2 }' data.file
Joel Craig NW
Sharon Kelly WE
Chris Foster SW
May Chin SO
Derek Johnson SE
Susan Beal EA
TJ Nichols NE
Val Shultz NO
Sheri Watson CT

Formatting With print Statements

By adding tabs or other text inside double quotes, you can format the output neatly. In awk, anything in double-quotes is a string constant. You can use string constants in many places in awk—in print statements and as values to be assigned to variables among other things. Some special formatting characters use letters, such as \t for tab. You can also specify an octal value, such as \011 for tab. Some of the formats you can use are shown in the table below.

Characters Meaning
\t Tab
\n Newline
\007 Bell
\011 Tab
\012 Newline
\042 "
\044 $
\045 %

If the fields in the awk print statement are separated by commas (,), then the fields are separated by a space when they are printed. The comma is not a required part of the syntax. In a print statement, the comma actually represents the value of the awk variable OFS (output field separator). The default value of the OFS variable is a single space.

The following example adds a single tab character between Field 4 and Field 2.

$ awk '{ print $3, $4 "\t" $2 }' data.file
Joel Craig       NW
Sharon Kelly     WE
Chris Foster     SW
May Chin         SO
Derek Johnson    SE
Susan Beal       EA
TJ Nichols       NE
Val Shultz       NO
Sheri Watson     CT

Notice that it was not necessary to use the comma before or after the tab (\t) in the previous print statement. The tab forces the output that follows to begin at the next tab position.

Using Regular Expressions

Regular expression metacharacters can be used in the pattern. In the following example, awk searches for the pattern east, and prints all lines containing that pattern.

$ awk '/east/' data.file
southeast       SE      Derek Johnson   5.0 .70 4       17
eastern         EA      Susan Beal      4.4 .8  5       20
northeast       NE      TJ Nichols      5.1 .94 3       13

The following example prints only Fields 1, 5, and 4 from lines containing the pattern east.

$ awk '/east/ { print $1, $5, $4 }' data.file
southeast 5.0 Johnson
eastern 4.4 Beal
northeast 5.1 Nichols

The following example prints Fields 1 and 5, and then a tab before Field 4, for all lines containing the pattern east.

$ awk '/east/ { print $1, $5 "\t" $4 }' data.file
southeast 5.0
Johnson eastern 4.4     Beal
northeast 5.1   Nichols

The string can contain regular expression characters. In the following example, the pattern east must be at the beginning of the line.

$ awk '/^east/' data.file
eastern         EA      Susan Beal      4.4 .8  5       20

The . specifies any single character. In this example, any character followed by a 9 would be a match.

$ awk '/.9/' data.file
northwest       NW      Joel Craig      3.0 .98 3       4
western         WE      Sharon  Kelly   5.3 .97 5       23
southern        SO      May Chin        5.1 .95 4       15
northeast       NE      TJ Nichols      5.1 .94 3       13
north           NO      Val Shultz      4.5 .89 5       9
central         CT      Sheri Watson    5.7 .94 5       13

Thus, the arrow points to a line of output you might not want if you are trying to identify lines that contain .9. To take away the special meaning of a regular expression character, precede it with a backslash (\).

$ awk '/\.9/' data.file
northwest       NW      Joel Craig      3.0 .98 3       4
western         WE      Sharon  Kelly   5.3 .97 5       23
southern        SO      May Chin        5.1 .95 4       15
northeast       NE      TJ Nichols      5.1 .94 3       13
central         CT      Sheri Watson    5.7 .94 5       13

The BEGIN and END Special Patterns

There are two special patterns that are not used to match text in a file. The BEGIN pattern (all letters must be uppercase) indicates an action that occurs before any of the input lines are read. It is commonly used for printing a heading or title line for the report before any data is processed and to assign values to built-in variables.

The END pattern (all uppercase) indicates an action occurs after all input records have been read and fully processed. It is commonly used to print summary statements or numeric totals

BEGIN and END statements can each occur multiple times in any awk program (and in any order). If there are multiple occurrences of either pattern, they are executed in the order in which they are found in the file.

The following example has a BEGIN statement to add a header to the output.

$ awk 'BEGIN { print "Eastern Regions\n" };  /east/ { print $5, $4 }'
data.file
Eastern Regions

5.0 Johnson
4.4 Beal
5.1 Nichol

Although you can use multiple lines for the awk command, the beginning brace of the action for the BEGIN and END patterns must appear on the same line as the keyword BEGIN or END.

The correct use is:

$ awk 'BEGIN {
> print "Eastern Regions\n"};  /east/ {print $5, $4}' data.file
Eastern Regions
5.0 Johnson
4.4 Beal
5.1 Nichols

An example of incorrect use is:

$ awk 'BEGIN
> { print "Eastern Regions\n" };  /east/ { print $5, $4 }' data.file
awk: syntax error at source line 2
context is
       BEGIN >>>
<<
nawk: bailing out at source line 2

The END pattern allows the action to occur at the end of the input file.

$ nawk 'BEGIN { print "Eastern Regions\n"}; /east/ {print $5, $4}
> END {print "Eastern Region Monthly Report"}' data.file
Eastern Regions

5.0 Johnson
4.4 Beal
5.1 Nichols
Eastern Region Monthly Report

Using awk Scripts

A awk script is a collection of awk statements (patterns and actions) stored in a text file. A awk script reduces the chance for errors because the commands are stored in a file and are read from the file each time they are needed. Give the script file a descriptive name. To instruct awk to read the script file, use the command:

awk -f script_file data_file
$ cat report.awk
BEGIN {print "Eastern Regions\n"}
/east/ {print $5, $4}
END {print "Eastern Region Monthly Report"}
$ awk -f report.awk data.file
Eastern Regions

5.0 Johnson
4.4 Beal
5.1 Nichols
Eastern Region Monthly Report

Using a awk script makes it easy to make changes or additions. In the following example, a second BEGIN statement is added to print an overall heading for the report. Remember, the BEGIN statements are executed in order.

$ cat report2.awk
BEGIN {print "** Acme Enterprises **"}
BEGIN {print "Eastern Regions\n"}
/east/ {print $5, $4}
END {print "Eastern Region Monthly Report"}
$ awk -f report2.awk data.file
** Acme Enterprises **
Eastern Regions

5.0 Johnson
4.4 Beal
5.1 Nichols
Eastern Region Monthly Report

Using Built-in Variables

As awk is processing the input file, it uses several variables. You can provide a value to some of these variables, while other variables are set by awk and cannot be changed. The table below lists some of the built-in variables.

Name Default value Description
FS Space or tab The input field separator
OFS Space The output field separator
NR N/A The number of records from the beginning of the first input file

Working With Variables

A variable value can be a number, a string, or a set of values in an array. The awk programming language uses several built-in variables. To assign a value to a variable, use the format:

variablename = value

Input Field Separator

The default input field separator (FS) is white space, which can be either a space or a tab. Frequently, other characters can separate the input, such as a colon or comma. You can set the input field separator variable with the -F option or set the value with an assignment. The following two examples both set the input field separator to a colon:

awk -F: 'statement' filename
awk 'BEGIN { FS=":" } ; statement' filename

When using the -F option or the FS variable, you can specify more than one field separator either by placing the value in square brackets (creating a character class), or by separating the values with a | (OR statement) within double quotes.

awk -F"[ :]" 'statement' filename
awk -F" |:" 'statement' filename
awk 'BEGIN { FS="[ :]" }  next_statement' filename
awk 'BEGIN { FS=" |:" }  next_statement' filename

For example, set the default input field separator if you want to process the /etc/group file, which has fields separated by a colon. In this case, you set the FS variable before processing the first record of the file.

$ awk 'BEGIN { FS=":" }; { print $1, $3 }' /etc/group
root 0
other 1
bin 2
sys 3
adm 4
uucp 5
mail 6
tty 7
lp 8
nuucp 9
staff 10
daemon 12
sysadmin 14
nobody 60001
noaccess 60002
nogroup 65534

You could save the previous command into a awk script:

$ cat report3.awk
BEGIN { FS=":" }
{ print $1, $3 }
$ awk -f report3.awk /etc/group
root 0
other 1
bin 2
sys 3
adm 4
uucp 5
mail 6
tty 7
lp 8
nuucp 9
staff 10
daemon 12
sysadmin 14
nobody 60001
noaccess 60002
nogroup 65534

Output Field Separator

The default output field separator is a space. In the print statement, a comma specifies using the output field separator. If you omit the comma, the fields run together. You can also specify a field separator directly in the print statement. Compare the following three lines:

$ awk '{ print $3 $4 $2 }' data.file
$ awk '{ print $3, $4, $2 }' data.file
$ awk '{ print $3, $4 "\t" $2 }' data.file

To set the output field separator, place the assignment within a BEGIN statement

$ awk 'BEGIN { OFS="\t" } ; { print $3, $4, $2 }' data.file
Joel    Craig
NW Sharon  Kelly
WE Chris   Foster
SW May     Chin
SO Derek   Johnson SE
Susan   Beal    EA
TJ      Nichols NE
Val     Shultz  NO
Sheri   Watson  CT

Number of Records

The number of records (NR) variable counts the number of input lines read from the beginning of the first input file. The variable’s value updates each time another input line is read. Inside the BEGIN pattern the value of NR is zero. Inside the END pattern the value of NR is the number of the last record processed.

$ more report4.awk
{ print $3, $4, $2 }
END { print "The number of employee records is " NR }
$ awk -f report4.awk data.file
Joel Craig NW
Sharon Kelly WE
Chris Foster SW
May Chin SO
Derek Johnson SE
Susan Beal EA
TJ Nichols NE
Val Shultz NO
Sheri Watson CT
The number of employee records is 9