The uniq program is used to identify, count, or remove duplicate records in sorted data. If given command-line arguments, they are interpreted as filenames for files on which to operate. If no arguments are provided, the uniq command operates on standard in. Because the uniq command only works on already sorted data, it is almost always used in conjunction with the sort command.
The uniq command uses the following command line switches to qualify its behavior.
|-c, –count||Prefix line with the number of its occurrences; this is the length of the “run”.|
|-d, –repeated||Print only duplicated lines.|
|-f, –skip-fields=n||Avoid comparing the first nfields; fields are delimited by whitespace.|
|-i, –ignore-case||Ignore case.|
|-s, –skip-charsn||Skip the first n characters.|
|-u, –unique||Print only unique lines.|
|-w, –check-chars=n||Compare no more than n characters in each line.|
In order to understand the uniq command’s behavior, we need repetitive data on which to operate. The following python script simulates the rolling of three six-sided dice, writing the sum of 100 roles once per line. The user madonna makes the script executable and then records the output in a file called trial1.
$ cat three_dice.py #!/usr/bin/python from random import randint for i in range(100): print randint(1,6)+randint(1,6)+randint(1,6)
$ chmod 755 three_dice.py $ ./three_dice.py > trial1 $ wc trial1 100 100 260 trial_run
$ head trial1 10 10 10 13 8 8 10 10 8 6
Reducing Data to Unique Entires
Now, madonna would like to analyze the data. She begins by sorting the data and piping the output through the uniq command.
$ sort -n trial1 | uniq 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Without any command line switches, the uniq command has removed duplicate entries, reducing the data from 100 lines to only 15. Easily, madonna sees that the data looks reasonable: the sum of every combination for three six sided die is represented, with the exception of 3. Because only one combination of the dice would yield a sum of 3 (all ones), she expects it to be a relatively rare occurrence.
Counting Instances of Data
A particularly convenient command line switch for the uniq command is -c, or –count. This causes the uniq command to count the number of occurrences of a particular record, prepending the result to the record on output.
In the following example, madonna uses the uniq command to reproduce its previous output, this time prepending the number of occurrences of each entry in the file.
$ sort -n trial1 | uniq -c 1 4 4 5 6 6 10 7 10 8 13 9 13 10 9 11 13 12 4 13 8 14 4 15 1 16 2 17 2 18
As would be expected (by a statistician, at least), the largest and smallest numbers have relatively few occurrences, while the intermediate numbers occur more numerously. The first column can be summed to 100 to confirm that the uniq command identified every occurrence.
Identifying Unique or Repeated Data with uniq
Sometimes, people are just interested in identifying unique or repeated data. The -d and -u command line switches allow the uniq command to do just that. In the first case, madonna identifies the dice combinations that occur only once. In the second case, she identifies combinations that are repeated at least once.
$ sort -n trial0 | uniq -u 4 16
$ sort -n trial1 | uniq -d 5 6 7 8 9 10 11 12 13 14 15 17 18