Description
We want to extract every IMG element from an HTML file,
where
- There may be more than one IMG elements in a line.
- An IMG element may span more than one lines.
- Any angle bracket inside a pair of double quotes is treated literally.
|
| Raw Input
| Not this "<IMG src=test alt=fake>" but this <IMG src=url_1
alt="a normal case...">. The second ...
<IMG src=url_2 alt="test for embedded >
in quotes" width=200 height=200>. Not this "<IMG src=another
alt=cheat> but this <IMG src=url_3 title="
another > test" alt="another test">, the fourth is <IMG
src=url_4 width=200 height=250>. Finally, there are the
fifth <IMG src=url_5 width=250> sixth <IMG src=url_6 width=100
height=200>. Bye!
|
|
| Desired Output
| <IMG src=url_1 alt="a normal case...">
<IMG src=url_2 alt="test for embedded > in quotes" width=200 height=200>
<IMG src=url_3 title=" another > test" alt="another test">
<IMG src=url_4 width=200 height=250>
<IMG src=url_5 width=250>
<IMG src=url_6 width=100 height=200>
|
|
Script and Comments
Script1 [ 1] /^([^"<]*"[^"]*")*[^"<]*<IMG( |$)/!d
[ 2] s/^([^"<]*"[^"]*")*[^"<]*//
[ 3] :loop
[ 4] $!{
[ 5] N
[ 6] s/\n/ /g
[ 7] /([^">]*"[^"]*")*[^">]*>/!b loop
[ 8] }
[ 9] s/([^">]*"[^"]*")*[^">]*>/&\n/
[10] P
[11] D
| |
|