HTML Content Extractor

HTML Content Extractor is a special component to clean HTML pages from common noise such as ads, banners, navigations links and menus. It is automatic and does not require any user interaction during the extraction process.

User manual

This extraction can be useful if one wants to retrieve meaningful content from noisy web pages, for instance, news article from some Web news server. The following figures below show an example of an original news article (left) with many navigations menus, ads and extra boxes; and the same page after HTML Content Extractor cleanings (right).

So, what exactly this program does to clean noisy HTML documents and how it detects key content to extract? Saying in a few words, it builds a DOM tree of an HTML page and then applies predefined rules to clean it from unnecessary elements and DOM segments. Sample of these rules is shown in a properties file below. It guides HTML Content Extractor to remove all HTML comments (1), SCRIPT NOSCRIPT INPUT BUTTON LINK STYLE SELECT EMBED OBJECT IMG IFRAMEtags with their content (2); delete all attributes in TD TR TABLE BODY DIV LI UL tags (5). To recognize ads URLs and remove them (3) there is a long list of common advertisement servers (over 22000 ads servers). It is loaded during start up of the extractor and every link is checked against it. Removing text empty elements (4) is a way to detect and delete HTML nodes without any meaningful textual information. For each such a node, a ratio number of linked words to total number of words is calculated. If this ratio is more then defined threshold linkTextRatio then this node will be removed. This approach supposes to delete navigation links like menus, as well as some advertisement links and banners.

// 1. remove html comments  isRemoveComments=true isKeepTags=true keepTags=BR P // <meta ...> tags processing isProcessMeta = true isRemoveAllMeta = false // 2. remove tags isRemoveTags = true removeTags = SCRIPT NOSCRIPT INPUT BUTTON LINK STYLE SELECT EMBED OBJECT IMG IFRAME // 3. remove ads links isRemoveAdLinks = true adsServerListFile = serverlist.txt // 4. remove ‘noisy’ elements isRemoveTextEmptyElements = true isRemoveLinkCells = true substanceMinTextLength = 5 letternsPerWord = 5 linkTextRatio = 0.20 minNumOfWords = 15 // 5. remove attributes in specified nodes isRemoveAttr = true removeAttrNodes = TD TR TABLE BODY DIV LI UL // 6. remove common links (experimental) isRemoveCommonLinks = true

Additional parameter minNumOfWords depicts a minimum number of unlinked words that should be in elements like table or sell. In our case, any table or cell in a table that has less then 15 words will be excluded from the final cleaned document.

There is also a way to perform additional removing of unnecessary URLs – download more HTML pages form the same web server, compare their DOM trees with the cleaned document and remove common elements and patterns. Partially this idea is implemented as a removing common links feature (6). This feature is still experimental, so use it carefully.

After you have tweaked keycontent.properties file according to your needs you can run kce.jar utility program from the kce distribution in the following ways:

Clean an HTML file. Specify a name html_file of an HTML file to clean and a character encoding charset of its content. If charset is omitted then ISO-8859-1 (latin-1) will be used by default. Result of the cleaning will be printed to standard output in UTF-8 encoding:

shell> java -jar kce.jar -clean html_file [charset]
Download and clean an HTML source from a URL. It is the same as the previous one except that source of the HTML page comes from a URL.

shell> java -jar kce.jar -clean url_to_clean [charset]

Start content extractor in a server mode. This will start a TANA server on specified port port_number.

shell> java -jar kce.jar -server -p port_number

Now you can send TANA messages to this server by TanaSend.jar program. Accepted TANA message should have the following key-value pairs:

Required	Key	Value	Description
yes	cmd	filter	Command to apply key content extraction for specified content.
yes	content	string	HTML content to be filtered.
no	charset	string	Character encoding of the content. If it is not specified then ISO-8859-1 will be used by default.
no	url	UTF-8 string	URL in UTF-8 encoding of the content.
no	html	true\|false	Request (do not request) result of the cleaning in HTML format. By default is true.
no	txt	true\|false	Request (do not request) result of the cleaning in textual format. By default is false.

Sample requests to the server:
shell> java -jar TanaSend.jar 12345 cmd:filter content:"some html content to filter"
shell> java -jar TanaSend.jar 12345 cmd:filter content:"some html content to filter" charset:utf-8 html:false txt:true url:http://some.url

Developer Manual

HTML Content Extractor is easy to integrate with your program. Main class that does all processing is Kce. In addition, it can accept any implementations of NodeFoundListenerinterface. This is useful if you, for instance, want to extract all links and titles from HTML pages in advance to cleaning. This code snippet will help you to include HTML Content Extractor library into your Java source code:

01 // allocate a new extractor with default settings 02 KceSettings settings = new KceSettings(); 03 // load settings from "conf/keycontent.properties" file 04 settings.loadSettings("conf/keycontent.properties"); 05 // construct a new extractor 06 Kce extractor = new Kce(settings); 07 // register additional listeners 08 LinkFoundListener linkFoundListener = new LinkFoundListener(); 09 TitleFoundListener titleNodeFoundListener = new TitleFoundListener(); 10 extractor.registerNodeFoundListener(linkFoundListener); 11 extractor.registerNodeFoundListener(titleNodeFoundListener); 12 // perform extraction of key content from a html file "file.html" encoded as ISO-8859-1 13 Document document = extractor.extractKeyContent(new FileInputStream(new File("file.html")), 14 "ISO-8859-1", null); 15 if (document != null) { // cleaning was successful 16 StringWriter stringWriter = new StringWriter(); 17 // present cleaned document as String 18 Kce.prettyPrint(document, "utf-8", stringWriter); 19 System.out.println(stringWriter); 20 }

License

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.