Bug parsing and cleaning IMG tag attributes

From: Paul Wallingford (
Date: Thu 10 Jul 2003 - 09:15:15 GMT

  • Next message: Holger Kunst: "Sanitizer Hangs"


    I am using sanitizer 1.59. I just downloaded 1.63 but have not yet
    installed it yet. I see from the changelog that the XML style IMG tag
    bug that I reported earlier has been addressed).

    However, I don't see anywhere in the changelog that the following has
    been addressed so I am reporting it just in case.

    The following IMG tag fails to be cleaned properly:

    <img style="WIDTH: 259px; HEIGHT: 428px" height="428"

    Notice the lack of a space between the src attribute and the width
    attribute. However, this tag renders on Mozilla 1.4. I have not tested
    it to determine if the trailing width and border are recognized by the
    browser or discarded. Since this tag is not inside an <A></A> pair, the
    border part is irrelevant and the height and width are ignored in favor
    of the style part. Because of the lack of space after the src
    attribute, Anomy fails to defang the src attribute, but it DOES defang
    the style attribute.

    <img DEFANGED_style="WIDTH: 259px; HEIGHT: 428px" height="428"

    I am not sure how other mail readers would render this tag, but all
    those based on the gecko engine (Netscape 7, Mozilla, etc) will let the
    spammer sneak this by.

    Anomy should probably emulate the behavior of the browser in parsing the
    tags. The following rules are based on my experience and past bugs I
    have reported.

    1) extract tags based on opening and closing brackets < and >. DO NOT
    consider paired quotes inside the tag. The browser does not. That is,
    <img src="abc.gif><img src="def.gif> is two tags even though the inside
    >< brackets are contained within quotes and officially this should only
    be one big screwed up tag.

    2) Once a tag is isolated, parsing the components should proceed as follows:

    First, remove the brackets.

    Second, the tag name (img) should terminate with a space or the end of

    Third, attributes should be scanned for the following cases:

    A) attribute with no value terminated by a space or end of string:
        <hr NOSHADE> or <hr NOSHADE width="90%">

    B) attribute with value WITHOUT quotes, terminated by space, quote or
    end of string:
        <hr WIDTH=90%> or <hr WIDTH=90% noshade> or
        <hr WIDTH=90%">

    C) attribute with value WITH quotes, terminated by the closing quote or
    end of string:

       <hr WIDTH="90%"> or <hr WIDTH="90%" noshade> or
       <hr WIDTH="90%"noshade> or <hr WIDTH="90%>

    If you need help with the regexes or code to accomplish this, let me know.

    Paul Wallingford

    hosted by