I have not checked all mail readers, so there may be some discrepancy in
how HTML is parsed in mail. Mozilla has taken the stance that both <
and > are special characters. They override most other things, like
quotes. This is the problem.
In order for cleaning to work in Anomy, a tag needs to be extracted
based on the opening "<" and closing ">" regardless of the content of
that tag. Spammers and hackers have been exploiting this lately, which
makes me think this HTML interpretation is prevalent in more than just
Mozilla.
The following is the actual tag that fails to get cleaned, and yet
results in the browser pulling images from a remote server.
<img src=
=3D"http://www.discountviagra.com.ar/add/vigad2.jpg" border=3Do">
This translates into:
<img src="http://www.discountviagra.com.ar/add/vigad2.jpg" border=o">
Note the unbalanced quotes at the end. The illegal specification for
the border (using an alpha oh "o" instead of a numeric zero "0" is not
the cause of the problem and the browser ignores this).
I also tested the following:
1. <img src="http://www.yahoo.com/image.jpg" border=o">
2. <img src="http://www.yahoo.com/image.jpg" border="0>
3. <img src="http://www.yahoo.com/image.jpg>
I would expect the "src=" to be changed to "DEFANGED_src=". All three
of these escaped cleaning and remained intact.
NOTE: The browser is obviously pulling out the tag *FIRST* based on the
"<" and ">", then parsing the contents. In the first two, the src
attribute is valid and the browser pulls the image from the remote
server. The border attribute is illegal in both 1 and 2 and the browser
ignores it. However, in #3, the src attribute is illegal (unbalanced
quotes) so the browser ignores it and does not grab the image.
So, basically, if a spammer/virus/hacker creates a tag that has
unbalanced quotes, everything up until then is considered good by the
browser and everything from then on to the end of the tag is ignored as
invalid. HTMLCleaner misses this altogether and doesn't clean the tag.