Parsing error in HTML Cleaner 1.21 (Sanitizer 1.59)

From: Paul Wallingford (
Date: Mon 19 May 2003 - 10:22:57 GMT

  • Next message: Paul Wallingford: "Re: Help! Perl unicode & Anomy? (was: Question on failures on testall)"

    I have not checked all mail readers, so there may be some discrepancy in
    how HTML is parsed in mail. Mozilla has taken the stance that both <
    and > are special characters. They override most other things, like
    quotes. This is the problem.

    In order for cleaning to work in Anomy, a tag needs to be extracted
    based on the opening "<" and closing ">" regardless of the content of
    that tag. Spammers and hackers have been exploiting this lately, which
    makes me think this HTML interpretation is prevalent in more than just

    The following is the actual tag that fails to get cleaned, and yet
    results in the browser pulling images from a remote server.

    <img src=
    =3D"" border=3Do">

    This translates into:

    <img src="" border=o">

    Note the unbalanced quotes at the end. The illegal specification for
    the border (using an alpha oh "o" instead of a numeric zero "0" is not
    the cause of the problem and the browser ignores this).

    I also tested the following:

    1. <img src="" border=o">
    2. <img src="" border="0>
    3. <img src=">

    I would expect the "src=" to be changed to "DEFANGED_src=". All three
    of these escaped cleaning and remained intact.

    NOTE: The browser is obviously pulling out the tag *FIRST* based on the
    "<" and ">", then parsing the contents. In the first two, the src
    attribute is valid and the browser pulls the image from the remote
    server. The border attribute is illegal in both 1 and 2 and the browser
    ignores it. However, in #3, the src attribute is illegal (unbalanced
    quotes) so the browser ignores it and does not grab the image.

    So, basically, if a spammer/virus/hacker creates a tag that has
    unbalanced quotes, everything up until then is considered good by the
    browser and everything from then on to the end of the tag is ignored as
    invalid. HTMLCleaner misses this altogether and doesn't clean the tag.

    hosted by