Alfarebis

Musings on informatics, language, music...

Comparing lists: comm

The Unix tool comm is an interesting tool to compare list files. You can show the lines common to both files, or the lines appearing only in one of the two selected files.


Unix has three basic tools for comparing files: diff, comm and cmp. The first shows the differences between two text files, indicating which lines have been changed, deleted or appended. The command comm shows which lines are unique to the first or second file in the comparison, or which lines are common to both text files. Unlike diff, the tool comm requires the input files to be sorted. Diff and comm are line-oriented and are therefore used for comparing text files. The third tool cmp shows differences at byte level, and is therefore the appropriate tool for comparing binary files, although it can also be useful for comparing text files at byte level. In the rest of this text, I will ignore cmp and diff, giving examples of how comm can be used to compare lexicons.

Suppose we have two word lists, one containing nouns, the other containing verbs. As you can see from the following two lists, some words occur in both files, because they can be both nouns and verbs. The tool comm can help to select all lines (i.e. words in our example) which are common to both files; but also the lines which are unique to the first, or unique to the second file. Before illustrating the use of comm, it is important to stress that the input files have to be sorted.

 $ cat nouns.txt
 answer
 choice
 drink
 explanation
 food
 record
 search

 $ cat verbs.txt
 answer
 choose
 drink
 eat
 explain
 record
 search

If we run the tool comm, without any option flag, we get three columns of output. The first column contains the lines which occur only in the first file. The second column lists the unique lines of the second file, and the third column lists the lines common to both files.

 $ comm nouns.txt verbs.txt
                     answer
 choice
           choose
                     drink
           eat
           explain
 explanation
 food
                     record
                     search

If we compare the files nouns.txt and verbs.txt, comm will start reading the first line of nouns.txt and compares this line with the first line of verbs.txt. Since both files contain the line answer, comm places the line in the third column of the output. The first two columns remain empty. The same happens for the following three lines. The first difference occurs on the fifth line, where we find choice in nouns.txt but choose in verbs.txt. Since choice is unique to the first file, it will be placed in the first column. The word choose is only found in the second file, and will therefore be placed in the second column. This type of comparison continues for the rest of the two files, as shown in the previous output.

The output becomes more interesting if we want to select, for example, only the unique lines of the first file, or only the lines which are common to both files. This is simply done by using any combination of the flags 1, 2, 3, which refer to the columns which you do not want to show. For example, if you only want to see the unique lines of the first file, you only want to show column 1 of the default comm output. In other words, you do not want to see columns two and three. As a result, you get all words which can only be used as nouns, not as verbs:

 $ comm -23 nouns.txt verbs.txt
 choice
 explanation
 food

Similarly, if you want to see only the lines unique to the second file (these lines occur in the second column of the default comm output), you use the flag -13, for not showing the first and third column. Finally, if you want to know which words can be either noun or verb --i.e. the words common to both files--, you use the flag which eliminates the first and second column from the output: -12. Both selections are illustrated in the following two commands.

 $ comm -13 nouns.txt verbs.txt
 choose
 eat
 explain

 $ comm -12 nouns.txt verbs.txt
 answer
 record
 drink
 search

Like join, another basic Unix text utility tool, comm manipulates sets of lines, which enables simple database selection.

Tags: linux, comm