Description
We want to extract every IMG element from an HTML file, where
  • There may be more than one IMG elements in a line.
  • An IMG element may span more than one lines.
  • Assume that no literal > exists in attribute values. To take into account literal > enclosed in double quotes, for example, <img src=url alt="a > test...">, please visit p.20090921a.html.
Raw Input
This is the first image <IMG src=url_1 width=200 height=400>, followed
by the second <IMG src=url_2> and third <IMG src=
url_3 width=220 height=220>. There may be several IMGs in one line, like
<IMG src=url_4>, <IMG src=url_5> and <IMG src=
url_6 width=250 height=300
alt="test">.
Desired Output
<IMG src=url_1 width=200 height=400>
<IMG src=url_2>
<IMG src= url_3 width=220 height=220>
<IMG src=url_4>
<IMG src=url_5>
<IMG src= url_6 width=250 height=300 alt="test">
Script and Comments
Script1
[ 1] /<IMG( |$)/!d
[ 2] s/<IMG( |$)/\n&/
[ 3] s/^[^\n]*\n//
[ 4] :loop
[ 5] $!{
[ 6] />/!N
[ 7] />/!b loop
[ 8] }
[ 9] s/\n/ /g
[10] s/>/&\n/
[11] P
[12] D
Comments -r
  1. Lines do not contain (part of) any IMG element are discarded by Step [1].
  2. Given a line contains one or more IMGs, to extract the first element,
    • First we have to delete data before it, which can be done by
      • insert a newline character before it, then
      • delete all data up to and including the newline character.
      This is done via Steps [2] and [3].
    • Then we have to locate the end of that element:
      • If it does not fit in PS, we have to read enough lines to PS until PS has the whole element. This is done via a loop consisting of Steps [4] thru [8].
      • Embedded newline characters are replaced with spaces by Step [9].
      Once the closing > is found, a newline character is appended after it.
    • Step [11] prints the first element.
    • Step [12] deletes the first element, making the second the first. Then sed jumps to Step [1]. `Finds the first, prints, then deletes' repeats till PS does not have any element.