HTML Cleaner
How many of you needed to clean up those messy MS Word files in order to integrate them into valid W3C pages, or just integrate them in the overall design ?
I’ve looked for a good HTML Cleaner and did’t find a good free one.
Meanwhile, I’ve developed my own HTML Cleaner class in PHP, because I needed to clean up tons of word generated code in that time.
I’ve combined the strong HTML Tidy library with my own regular expression-based cleaning algorithms. I wanted a simple method to strip all unnecesarry tags and styles yet to keep it W3C standard compliant.
Synthax checking is beeing done only when using Tidy.
Note that this tool is designed to strip/clean useless tags and attributes back to HTML basics and optimize code, not sanitize (like HTMLPurifier).
Without the tidy PHP extension, the class can:
- remove styles, attributes
- strip useless tags
- fill empty table cells with non-breaking spaces
- optimize code (merge inline tags, strip empty inline tags, trim excess new lines)
- drop empty paragraphs
- compress (trim space and new-line breaks).
In conjunction with tidy, the class can apply all tidy actions (clean-up, fix errors, convert to XHTML, etc) and then optionally perform all actions of the class (remove styles, compress, etc).
Currently the following cleaning method is implemented: tag whitelist/attribute blacklist
Properties:
var $Options;
var $Tag_whitelist=‘<table><tbody><thead><tfoot><tr><th><td><colgroup><col>
<p><br><hr><blockquote>
<b><i><u><sub><sup><strong><em><tt><var>
<code><xmp><cite><pre><abbr><acronym><address><samp>
<fieldset><legend>
<a><img>
<h1><h2><h3><h4><h4><h5><h6>
<ul><ol><li><dl><dt>
<frame><frameset>
<form><input><select><option><optgroup><button><textarea>’;
var $Attrib_blacklist=‘id|on[\w]+’;
//array of inline tags that can be merged
var $CleanUpTags=array(‘a’,’span’,‘b’,‘i’,‘u’,’strong’,‘em’,‘big’,’small’,‘tt’,
‘var’,‘code’,‘xmp’,‘cite’,‘pre’,‘abbr’,‘acronym’,‘address’,‘q’,’samp’,
’sub’,’sup’);
var $TidyConfig;
var $Encoding=‘latin1′;
$this->Options = array(
‘RemoveStyles’ => true, //removes style definitions like style and class
‘IsWord’ => true, //Microsoft Word flag – specific operations may occur
‘UseTidy’ => true, //uses the tidy engine also to cleanup the source (reccomended)
‘CleaningMethod’ => array(TAG_WHITELIST,ATTRIB_BLACKLIST), //cleaning methods
‘OutputXHTML’ => true, //converts to XHTML by using TIDY.
‘FillEmptyTableCells’ => true, //fills empty cells with non-breaking spaces
‘DropEmptyParas’ => true, //drops empty paragraphs
‘Optimize’ =>false, //Optimize code – merge tags
‘Compress’ => false); //trims all spaces (line breaks, tabs) between tags and between words.
// Specify TIDY configuration
$this->TidyConfig = array(
‘indent’ => true, /*a bit slow*/
‘output-xhtml’ => true, //Outputs the data in XHTML format
‘word-2000′ => false, //Removes all proprietary data when an MS Word document has been saved as HTML
//’clean’ => true, /*too slow*/
‘drop-proprietary-attributes’ =>true, //Removes all attributes that are not part of a web standard
‘hide-comments’ => true, //Strips all comments
‘preserve-entities’ => true, // preserve the well-formed entitites as found in the input
‘quote-ampersand’ => true,//output unadorned & characters as &.
‘wrap’ => 200); //Sets the number of characters allowed before a line is soft-wrapped
Methods:
function cleanUp($encoding=‘latin1′) //actual cleanup function
See it in action:
http://luci.criosweb.ro/scripts/HTMLCleaner/
Download latest version
Licenced under Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported (http://creativecommons.org/licenses/by-nc-sa/3.0/)
for personal, non-commercial use
For commercial use one developer licence costs 15 EUROs
Changes:
v.1.0
-taken from RC6
v. 1.0 RC6
-added option to apply tidy before internal cleanup
-added function TidyClean() that cleans only with Tidy the source from html, modifying it
-changed licence to Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported
v. 1.0 RC5
-tidy cleanup works also with PHP 4.3 now. Correction: class is compatible with PHP >=4.3. PHP 5 recommended. Basic cleanup (no tidy) can work with earlier versions of PHP 4
-removed drop-empty-paras option from default tidy config since there is already an internal drop-empty-paras mechanism
-Optimize now defaults to true since is very useful
-new default tidy config options:
‘preserve-entities’ => true, // preserve the well-formed entitites as found in the input (to display correctly some chars)
‘quote-ampersand’ => true,//output unadorned & characters as & (as required by W3C)
-default Encoding set to latin1
v. 1.0 RC4
-the class is now compatible with PHP 4.4 or higher (maybe 4.0, but never tested)
-minor bugfix for Optimize (loop until optimized now works correctly)
v. 1.0 RC3
-cleaning is now done case insensitive
-improved optimize, removed EXPERIMENTAL tag
-default tidy config now sets word-2000 to false
Lucian Sabo
|
Tech stuff (english)
|
08 4th, 2007
|
7 People have left comments on this post
Great job!
1) How about to cleanup:
<b>«</b><b>British</b><b> </b><b>h</b><b>о</b><b>use</b><b>»</b>
2) How about to convert e-mail and web links without hyperlinks to real hyperlinks? Sample:
*******************
The ABC Ltd.
11, Nevskiy ave, St.Petersburg, Russia
phone +7 911 123-45-87
e-mail abc@abc.ru
*********************
Convert into:
abc@abc.ru
Thans for your work. Regards
Version 0.9 introduces the tag-merging cleanup suggested by KonstRuctor.
>>2) How about to convert e-mail and web links >>without hyperlinks to real hyperlinks?
The source is taken from a WYSIWYG editor that usually already converts links.
Excellent work! Thank you very much.
I had one more thought concerning the given algorithm.
Sometimes it is necessary to clear all unnecessary formatting, but thus to leave attribute class at text paragraphs.
It is necessary, first of all, at imposition of long catalogues, when the firm name – a class zagol, the firm address – address, the description of activity of firm – a class about. Good luck!
You can disable RemoveStyles completely.
To leave only the class attribute you must modify the cleanUp method at line 122:
from
$this->RemoveBlacklistedAttributes(‘class|style’);
to
$this->RemoveBlacklistedAttributes(’style’);
Unfortunately there is no support yet for attribute removal on a particular tag. Removal is done globally.
Hai, it works great. Thanks a million. It did take me some time though before I figured out the html has to be surrounde by tags
To make HTMLCleaner parse documents without the body tag, add a condition in HTMLCleaner.php around line 113 like this:
if(stripos($this->html, “<body”))
$this->html = stristr( $this->html, “<body”);
This modification will be included in a future release.
>this doesnt strip the comments msword leaves:
Actually it does strip comments. Please send my by email the html code you want to cleaup.
Sorry, comments for this entry are closed at this time.