Description
Two adjacent lines are referred to as almost identical if they differ only in the last character. Given a datafile, there is at most one line almost identical to a given one. We want to remove lines almost identical to the others.
Raw Input
3 some text a
3 some text b
7 more text a
7 more text b
83 non matching line a
375 some more text a
375 some more text b
478 another non matching line b
Desired Output
83 non matching line a
478 another non matching line b
Script and Comments
Script1
[ 1] $!N
[ 2] /^([^\n]*).\n\1.$/d
[ 3] P
[ 4] D
Comments
  1. The Pattern Space is abbreviated to `PS'.
  2. The `-r' option of GNU sed must be used or you have to escape the parentheses used in Step [2].
  3. To know whether the line in question is almost identical to the following one, Step [1] is used to append the next one to the PS.
  4. Now the PS has two lines:
    • If ^([^\n]*).\n\1.$ matches the PS, these two lines are almost identical; therefore, command `d' `deletes' them.
    • Otherwise, since there is no line almost identical to the first line of the PS, it will be printed by Step [3] and then deleted by Step [4]; Command `D' of Step [4] also make sed jump to Step [1] to start the next iteration on the second line of the PS.