Extracting Text with cut
The cut command extracts columns of text from a text file or stream. Imagine taking a sheet of paper that lists rows of names, email addresses, and phone numbers. Rip the page vertically twice so that each column is on a separate piece. Hold onto the middle piece which contains email addresses, and throw the other two away. This is the mentality behind the cut command.
The cut command interprets any command line arguments as filenames of files on which to operate, or operates on the standard in stream if none are provided. In order to specify which bytes, characters, or fields are to be cut, the cut command must be called with one of the following command line switches.
|-b list||Extract bytes specified in list|
|-c list||Extract characters specified in list|
|-f list||Extract fields specified in list|
The list arguments are actually a comma-separated list of ranges. Each range can take one of the following forms.
|N||Only item number N .|
|N-||Items N through the end of the line.|
|N-M||Items N through M (inclusive).|
|-M||From the beginning of the line through item number M (inclusive).|
|–||All items from the beginning of the line through the end of the line.|
Extracting text by Character Position with cut -c
With the -c command line switch, the list specifies a character’s position in a line of text, where the first character is character number 1. As an example, the file /proc/interrupts lists device drivers, the interrupt request (IRQ) line to which they attach, and the number of interrupts which have occurred on that IRQ line. (Do not be concerned if you are not yet familiar with the concepts of a device driver or IRQ line. Focus instead on how cut is used to manipulate the data).
# cat /proc/interrupts CPU0 CPU1 0: 128 0 IO-APIC 0-edge timer 1: 10 0 IO-APIC 1-edge i8042 4: 979 0 IO-APIC 4-edge ttyS0 8: 0 0 IO-APIC 8-edge rtc0 9: 0 0 IO-APIC 9-fasteoi acpi 12: 0 154 IO-APIC 12-edge i8042 24: 0 10 PCI-MSI 65536-edge nvme0q0 25: 1141 609 PCI-MSI 81920-edge ena-mgmnt@pci:0000:00:05.0 26: 1670 2944 PCI-MSI 81921-edge eth0-Tx-Rx-0 27: 1272 3692 PCI-MSI 81922-edge eth0-Tx-Rx-1 ...
Because the characters in the file are formatted into columns, the cut command can extract particular regions of interest. If just the IRQ line and the number of interrupts were of interest, the rest of the file could be cut away, as in the following example. (Note the use of the grep command to first reduce the file to just the lines pertaining to interrupt lines.)
# grep '[[:digit:]]:' /proc/interrupts | cut -c1-15 0: 128 1: 10 4: 979 8: 0 9: 0 12: 0 24: 0 25: 1141 26: 1820 27: 1557 28: 8956 29: 0
Alternately, if only the device drivers bound to particular IRQ lines were of interest, multiple ranges of characters could be specified.
# grep '[[:digit:]]:' /proc/interrupts | cut -c1-5,34- 0: PIC 0-edge timer 1: PIC 1-edge i8042 4: PIC 4-edge ttyS0 8: PIC 8-edge rtc0 9: PIC 9-fasteoi acpi 12: PIC 12-edge i8042 24: MSI 65536-edge nvme0q0 25: MSI 81920-edge ena-mgmnt@pci:0000:00:05.0 26: MSI 81921-edge eth0-Tx-Rx-0 27: MSI 81922-edge eth0-Tx-Rx-1 28: MSI 65537-edge nvme0q1 29: MSI 65538-edge nvme0q2
If the character specifications were reversed, can the cut command be used to rearrange the ordering of the data?
# grep '[[:digit:]]:' /proc/interrupts | cut -c34-,1-5 0: PIC 0-edge timer 1: PIC 1-edge i8042 4: PIC 4-edge ttyS0 8: PIC 8-edge rtc0 9: PIC 9-fasteoi acpi 12: PIC 12-edge i8042 24: MSI 65536-edge nvme0q0 25: MSI 81920-edge ena-mgmnt@pci:0000:00:05.0 26: MSI 81921-edge eth0-Tx-Rx-0 27: MSI 81922-edge eth0-Tx-Rx-1 28: MSI 65537-edge nvme0q1 29: MSI 65538-edge nvme0q2
The answer is no. Text will appear only once, in the same order it appears in the source, even if the range specifications are overlapping or rearranged.
Extracting Fields of Text with cut -f
The cut command can also be used to extract text that is structured not by character position, but by some delimiter character, such as a TAB or “:“. The following command line switches can be used to further qualify what is meant by a field, or more selective select source lines.
|-d DELIM||Use DELIM to separate fields on input, instead of the default TAB character.|
|-s||Do not include lines that do not contain the delimiter character (useful for stripping comments and headers).|
|–output- delimiter=STRING||On output, use the text specified by STRING instead of the input field delimiter.|
For example, the file /usr/share/hwdata/pcitable lists over 3000 vendor IDs and device IDs (which can be probed from PCI devices), and the kernel modules and text strings which should be associated with them, separated by tabs.
$ head -15 pcitable # This file is automatically generated from isys/pci. Edit # it by hand to change a driver mapping. Other changes will # be lost at the next merge - you have been warned. # Edit by hand to change a driver mapping. Changes to descriptions # will be lost at the next merge - you have been warned. # If you run makeids, please make sure no entries are lost. # The format is ("%d\t%d\t%s\t"%s"\n", vendid, devid, moduleName, cardDescription) # or ("%d\t%d\t%d\t%d\t%s\t"%s"\n", vendid, devid, subvendid, subdevid, moduleName, cardDesc 0x0675 0x1700 "unknown" "Dynalink|IS64PH ISDN Adapter" 0x0675 0x1702 "hisax" "Dynalink|IS64PH ISDN Adapter" 0x09c1 0x0704 "unknown" "Arris|CM 200E Cable Modem" 0x0e11 0x0001 "ignore" "Compaq|PCI to EISA Bridge" 0x0e11 0x0002 "ignore" "Compaq|PCI to ISA Bridge" 0x0e11 0x0046 "cciss" "Compaq|Smart Array 64xx"
The following example extracts the third and fourth column, using the default TAB character to separate fields. Note the use of the -s command line switch, which effective strips the header lines (which do not contain any TABs).
$ cut -s -f3,4 pcitable | head "unknown" "Dynalink|IS64PH ISDN Adapter" "hisax" "Dynalink|IS64PH ISDN Adapter" "unknown" "Arris|CM 200E Cable Modem" "ignore" "Compaq|PCI to EISA Bridge" "ignore" "Compaq|PCI to ISA Bridge" "cciss" "Compaq|Smart Array 64xx" "unknown" "Compaq|NC7132 Gigabit Upgrade Module" "unknown" "Compaq|NC6136 Gigabit Server Adapter" "tmspci" "Compaq|Netelligent 4/16 Token Ring" "ignore" "Compaq|Triflex/Pentium Bridge, Model 1000"
As another example, suppose we wanted to obtain a list of the most commonly referenced kernel modules in the file. We could use a similar cut command, along with tricks learned in the last Lesson, to obtain a quick listing of the number of times each kernel module appears.
$ cut -s -f3 pcitable | sort | uniq -c | sort -rn | head 1988 "unknown" 148 "ignore" 83 "aic7xxx" 70 "gdth" 37 "e100" 37 "Card:ATI Rage 128" 36 "3c59x" 24 "Card:ATI Mach64" 21 "tulip" 20 "agpgart"
Many of the entries are obviously unknown, or intentionally ignored, but we do see that the aic7xxx SCSI driver, and the e100 Ethernet card driver, are commonly used.
Extracting Text by Byte Position with cut -b
The -b command line switch is used to specify which text to extract by bytes. Extracting text using the -b command line switch is very similar in spirit as using -c. In fact, when dealing with text encoded using the ASCII or one of the ISO 8859 character sets (such as Latin-1), the two are identical. The -b switch differs from -c, however, when using character sets with variable length encoding, such as UTF-8 (a standard character set on which many people are converging, and the default in Red Hat Enterprise Linux).
As a quick example, consider the following three characters of Germen text: für. When using UTF-8 encoding, the two characters which are part of the ASCII character set, “f” and “r”, are encoded using a single byte. The “ü”, however, is encoded using two bytes, as is evidenced by the wc command.
$ echo für | wc -c 5
Accounting, we have one byte each for the letters “f” and “r”, one byte for the newline which was appended to the output, leaving two bytes for the “ü”. When using cut -c, the “ü” would be considered a single character, but when using cut -b, “ü” would be considered two bytes, as in the following example.
$ echo fü | cut -c 1-2 fü
$ echo fü | cut -b 1-2 f?
The first time, the cut command counted the two bytes used to encode the “ü” as a single character, but the second time, it was considered two bytes. As a result, the character was “cut in half” by the cut command, and the terminal was not able to display it correctly.
Usually, cut -c is the proper way to use the cut command, and cut -b will only be necessary for technical situations.