Description
Given an HTML file with several `A' elements, we want to do some work on `A' elements whose start tags are <a href=XX>:
  • Remove the start tag, i.e., <a href=XX>,
  • Remove the associated end tag, </a>,
  • But keep the texts between the start and the end tags.
Raw Input
text1 <a href=normal>text2
</a> text3
<a href=XX>text4
text5</a> text6 <a href=normal>text7</a> text8 <a href=XX>text9</a> text10
<a href=XX>text11</a> text12 <a href=XX>text13</a> text14 <a href=normal>
text15</a>
Desired Output
text1 <a href=normal>text2
</a> text3
text4
text5 text6 <a href=normal>text7</a> text8 text9 text10
text11 text12 text13 text14 <a href=normal>
text15</a>
Script and Comments
Script1
[ 1] :top
[ 2] /<a href=XX>/!b
[ 3] s/<a href=XX>/\n/
[ 4] /\n.*<\/a>/!{
[ 5] s/\n//
[ 6] :loop
[ 7] n
[ 8] /<\/a>/!b loop
[ 9] }
[10] s/<\/a>([^\n]*)$/\1/
[11] s/\n//
[12] b top
Comments
  1. The `-r' option of GNU sed must be used or we have to escape the parentheses used in Step [11].
  2. There may exist several <a href=XX> and </a> in a given line.
  3. If a line has both the start and the end tags, to determine whether there exists the associated end tags:
    • Step [3] replaces a start tag with a newline.
    • Step [4] tests whether there exists an end tag after the start tag at the same line.
    • If the end tag is not at the same line, the loop consisting of Steps [6] thru [8] iterates until the line containing the associated end tag is read.
  4. Since there may be more <a href=XX>--</a> pairs, Step [12] makes sed jump to Step [1].
  5. The flow chat is shown as follows: