HTML-ContentExtractor version 0.02 Web pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. This module is used to reduce the noise content in web pages and thus identify the content rich regions. A web page is first parsed by an HTML parser, which corrects the markup and creates a DOM (Document Object Model) tree. By using a depth-first traversal to navigate the DOM tree, noise nodes are identified and removed, thus the main content is extracted. Some useless nodes (script, style, etc.) are removed; the container nodes (table, div, etc.) which have high link/text ratio (higher than threshold) are removed; (link/text ratio is the ratio of the number of links and non-linked words.) The nodes contain any string in the predefined spam string list are removed. INSTALLATION To install this module, run the following commands: perl Makefile.PL make make test make install SUPPORT AND DOCUMENTATION After installing, you can find documentation for this module with the perldoc command. perldoc HTML::ContentExtractor You can also look for information at: Search CPAN http://search.cpan.org/dist/HTML-ContentExtractor CPAN Request Tracker: http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-ContentExtractor AnnoCPAN, annotated CPAN documentation: http://annocpan.org/dist/HTML-ContentExtractor CPAN Ratings: http://cpanratings.perl.org/d/HTML-ContentExtractor COPYRIGHT AND LICENCE Copyright (C) 2007 Zhang Jun This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.