On Thu, Dec 15, 2005 at 10:16:44AM +1100, Noel Clarkson wrote:
...
> We have not fiddled with which HTML bits get filtered, just using the
> default, and I'm really not that sure what might be a good compromise,
> or if that isn't likely to help that much anyway. If anyone has ideas
> on what settings might be good to overcome this, I'd be really greatfull
> to know.
guess the short-buffer streaming mode of anomy makes that problematic.
Perhaps it's better done by a file_list_scanner, either:
1) lynx -force_html -dump %FILENAME %ATTNAME
that's what I'm doing, no html at all pls ;)
2) use some html-cleaner, eg htmlclean:
HTMLCLEAN(1) User Contributed Perl Documentation HTMLCLEAN(1)
NAME
htmlclean - a small script to clean up existing HTML
SYNOPSIS
htmlclean [-v] [-V] file1 [file2 file3 ...]
DESCRIPTION
This program provides a command-line interface to the
HTML::Clean module, which can help you to provide more
...
in Debian something using such is in pkgs:
$ dpkg -S htmlclean
wml: /usr/lib/wml/exec/wml_aux_htmlclean
libhtml-clean-perl: /usr/share/man/man1/htmlclean.1p.gz
libhtml-clean-perl: /usr/bin/htmlclean <-- you may want just this!
wp2x: /usr/share/doc/wp2x/filters/htmlcleanup1.pl
wp2x: /usr/share/doc/wp2x/filters/htmlcleanup2.pl
wp2x: /usr/share/doc/wp2x/filters/htmlcleanup3.pl
> The second problem was that the log files that are sometimes placed
> inline rather than as an attachment often became 10 or more times longer
never had such problem - not using inline log at all, so can't tell anything
here.
HTH
-- paolo