The Unix tool comm
is an interesting tool to compare list files. You can show the lines common to both files, or the lines appearing only in one of the two selected files.
Unix has three basic tools for comparing files: diff
, comm
and
cmp
. The first shows the differences between two text files,
indicating which lines have been changed, deleted or appended. The
command comm
shows which lines are unique to the first or second file
in the comparison, or which lines are common to both text
files. Unlike diff
, the tool comm
requires the input files to be
sorted. Diff
and comm
are line-oriented and are therefore used for
comparing text files. The third tool cmp
shows differences at byte
level, and is therefore the appropriate tool for comparing binary
files, although it can also be useful for comparing text files at byte
level. In the rest of this text, I will ignore cmp
and diff
, giving
examples of how comm
can be used to compare lexicons.
Suppose we have two word lists, one containing nouns, the other
containing verbs. As you can see from the following two lists, some
words occur in both files, because they can be both nouns and
verbs. The tool comm
can help to select all lines (i.e. words in our
example) which are common to both files; but also the lines which are
unique to the first, or unique to the second file. Before illustrating
the use of comm
, it is important to stress that the input files have
to be sorted.
$ cat nouns.txt
answer
choice
drink
explanation
food
record
search
$ cat verbs.txt
answer
choose
drink
eat
explain
record
search
If we run the tool comm
, without any option flag, we get three columns
of output. The first column contains the lines which occur only in the
first file. The second column lists the unique lines of the second
file, and the third column lists the lines common to both files.
$ comm nouns.txt verbs.txt
answer
choice
choose
drink
eat
explain
explanation
food
record
search
If we compare the files nouns.txt
and verbs.txt
, comm
will start
reading the first line of nouns.txt
and compares this line with the
first line of verbs.txt
. Since both files contain the line answer,
comm
places the line in the third column of the output. The first two
columns remain empty. The same happens for the following three
lines. The first difference occurs on the fifth line, where we find
choice in nouns.txt
but choose in verbs.txt
. Since choice is
unique to the first file, it will be placed in the first column. The
word choose is only found in the second file, and will therefore be
placed in the second column. This type of comparison continues for the
rest of the two files, as shown in the previous output.
The output becomes more interesting if we want to select, for example,
only the unique lines of the first file, or only the lines which are
common to both files. This is simply done by using any combination of
the flags 1
, 2
, 3
, which refer to the columns which you do not want to
show. For example, if you only want to see the unique lines of the
first file, you only want to show column 1
of the default comm
output. In other words, you do not want to see columns two and
three. As a result, you get all words which can only be used as nouns,
not as verbs:
$ comm -23 nouns.txt verbs.txt
choice
explanation
food
Similarly, if you want to see only the lines unique to the second file
(these lines occur in the second column of the default comm
output),
you use the flag -13
, for not showing the first and third
column. Finally, if you want to know which words can be either noun or
verb --i.e. the words common to both files--, you use the flag which
eliminates the first and second column from the output: -12
. Both
selections are illustrated in the following two commands.
$ comm -13 nouns.txt verbs.txt
choose
eat
explain
$ comm -12 nouns.txt verbs.txt
answer
record
drink
search
Like join
, another basic Unix text utility tool, comm
manipulates sets
of lines, which enables simple database selection.