Hello,
I am using sanitizer 1.59. I just downloaded 1.63 but have not yet
installed it yet. I see from the changelog that the XML style IMG tag
bug that I reported earlier has been addressed).
However, I don't see anywhere in the changelog that the following has
been addressed so I am reporting it just in case.
The following IMG tag fails to be cleaned properly:
<img style="WIDTH: 259px; HEIGHT: 428px" height="428"
src="http://www3.business-merchandising.com/gfx/couple.jpg"width=259
border=0>
Notice the lack of a space between the src attribute and the width
attribute. However, this tag renders on Mozilla 1.4. I have not tested
it to determine if the trailing width and border are recognized by the
browser or discarded. Since this tag is not inside an <A></A> pair, the
border part is irrelevant and the height and width are ignored in favor
of the style part. Because of the lack of space after the src
attribute, Anomy fails to defang the src attribute, but it DOES defang
the style attribute.
<img DEFANGED_style="WIDTH: 259px; HEIGHT: 428px" height="428"
src="http://www3.business-merchandising.com/gfx/couple.jpg"width=259
border=0>
I am not sure how other mail readers would render this tag, but all
those based on the gecko engine (Netscape 7, Mozilla, etc) will let the
spammer sneak this by.
Anomy should probably emulate the behavior of the browser in parsing the
tags. The following rules are based on my experience and past bugs I
have reported.
1) extract tags based on opening and closing brackets < and >. DO NOT
consider paired quotes inside the tag. The browser does not. That is,
<img src="abc.gif><img src="def.gif> is two tags even though the inside
>< brackets are contained within quotes and officially this should only
be one big screwed up tag.
2) Once a tag is isolated, parsing the components should proceed as follows:
First, remove the brackets.
Second, the tag name (img) should terminate with a space or the end of
string.
Third, attributes should be scanned for the following cases:
A) attribute with no value terminated by a space or end of string:
<hr NOSHADE> or <hr NOSHADE width="90%">
B) attribute with value WITHOUT quotes, terminated by space, quote or
end of string:
<hr WIDTH=90%> or <hr WIDTH=90% noshade> or
<hr WIDTH=90%">
C) attribute with value WITH quotes, terminated by the closing quote or
end of string:
<hr WIDTH="90%"> or <hr WIDTH="90%" noshade> or
<hr WIDTH="90%"noshade> or <hr WIDTH="90%>
If you need help with the regexes or code to accomplish this, let me know.
Paul Wallingford