Description
We want to extract every IMG element from an HTML file, where
  • There may be more than one IMG elements in a line.
  • An IMG element may span more than one lines.
  • Any angle bracket inside a pair of double quotes is treated literally.
Raw Input
Not this "<IMG src=test alt=fake>" but this <IMG src=url_1
alt="a normal case...">. The second ...
<IMG src=url_2 alt="test for embedded >
in quotes" width=200 height=200>. Not this "<IMG src=another
alt=cheat> but this <IMG src=url_3 title="
another > test" alt="another test">, the fourth is <IMG
src=url_4 width=200 height=250>. Finally, there are the 
fifth <IMG src=url_5 width=250> sixth <IMG src=url_6 width=100
height=200>. Bye!
Desired Output
<IMG src=url_1 alt="a normal case...">
<IMG src=url_2 alt="test for embedded > in quotes" width=200 height=200>
<IMG src=url_3 title=" another > test" alt="another test">
<IMG src=url_4 width=200 height=250>
<IMG src=url_5 width=250>
<IMG src=url_6 width=100 height=200>
Script and Comments
Script1
[ 1] /^([^"<]*"[^"]*")*[^"<]*<IMG( |$)/!d
[ 2] s/^([^"<]*"[^"]*")*[^"<]*//
[ 3] :loop
[ 4] $!{
[ 5] N
[ 6] s/\n/ /g
[ 7] /([^">]*"[^"]*")*[^">]*>/!b loop
[ 8] }
[ 9] s/([^">]*"[^"]*")*[^">]*>/&\n/
[10] P
[11] D
Comments -r
  1. In the following, `not enclosed inside a pair of double quotes' is abbreviated to `non-enclosed'.
  2. Any line does not contain (part of) an IMG element will be discarded by Step [1],
    where ^([^"<]*"[^"]*")*[^"<]* matches the first non-enclosed <IMG.
  3. Then Step [2] deletes everything from the beginning of PS till but not including the first IMG element.
  4. To locate the closing bracket of the first IMG element, ([^">]*"[^"]*")*[^">]*> is used.
  5. Steps [3] thru [8] constitute a loop which keep appending lines to PS until the closing bracket of the first IMG element if found.
  6. Step [9] inserts a newline character after the closing bracket of the first IMG element.