HTML2XML

HTML2XML is a special component to extract data from HTML templates and represent it in XML format. This transformation is driven by special scripts that defines which parts of HTML DOM tree should be extracted and how.

To see an example of such transformation let us look at the following HTML page:

Suppose we are only interesting in the news item on this page selected by red box. Moreover, in this content it is better to recognize such things as title, date, author, news body, if they exit. HTML2XML transformer can do this, and the resulted XML file for our page will look like:

<?xml version="1.0" encoding="utf-8"?>
<DOCUMENT>
<article>
<content>AOL is struggling and it looks like Google might do for them, what Overture did for Yahoo. Some interesting facts in this article including:

Paid search is becoming a force to be reckoned with and is expected to reach $5 billion in revenue by 2008. (explains Yahoo's acquisition of Overture).

Google will add $28 million to AOL's revenue this year, rising to $76 million by 2007.

AOL is reportedly considering placing Google's Adwords throughout the AOL network. What's stopping them? They need a way to take advantage of the reducing subscriber numbers before they lose them all.
Can Google save America Online? | CNET News.com</content>
<links>
<link>http://rss.com.com/2100-1024_3-
053461.html?type=pt&part=rss&tag=feed&subj=news</link>
</links>
<date>July 24, 2003</date>
<iso-date>2003-07-24</iso-date>
</article>
</DOCUMENT>

You may notice a few things here: content represents the news article as a text with the layout close to the original HTML view; links from this content are extracted and presented in separate section links; a date of publishing is also parsed into ISO-8601 format as iso-date.

To do such transformation HTML2XML module needs a special script for each HTML template. This script is actually a set of commands to navigate in DOM tree. It has a simple syntax and an idea behind it. For instance, for our previous HTML page a script to parse it into an XML document is the following:

N	command	parameters	true false
01	FIND_NODE	... div class="Post" ...	[next][END_OF_DOCUMENT]
02	SAVE_POS	$content_start	[next][END_OF_DOCUMENT]
03	FIND_NODE	... span class="PostTitle"	[next][GOTO_TASK 7]
04	STORE_TEXT	title -1 -1	[next][END_OF_DOCUMENT]
05	FIND_NODE	... /span class="PostTitle"	[next][GOTO_TASK 7]
06	SAVE_POS	$content_start	[next][END_OF_DOCUMENT]
07	FIND_NODE	... span class="PostFooter"	[next][END_OF_DOCUMENT]
08	SAVE_POS	$content_end	[next][END_OF_DOCUMENT]
09	STORE_TEXT	content $content_start $content_end	[next][END_OF_DOCUMENT]
10	STORE_LINKS	$content_start $content_end	[next][END_OF_DOCUMENT]
11	FIND_NODE	div class="DateHeader" ...	[next][END_OF_DOCUMENT]
12	SAVE_TEXT	$date -1 -1	[next][END_OF_DOCUMENT]
13	STORE_NODE	date $date	[next][END_OF_DOCUMENT]
14	STORE_ISODATE	MMMMM dd',' yyyy $date	[next][END_OF_DOCUMENT]
15	SET_POS	$content_end	[next][END_OF_DOCUMENT]
16	END_OF_ARTICLE		[next][END_OF_DOCUMENT]
17	GOTO_TASK	1	[][]

This script has 17 commands. Each command performs some specific task, such as finding position in a DOM tree that satisfies given template (01), saving this last found position in a variable $content_start (02), storing a node as title from the last found DOM position till the end of the node (04), again saving position in the variable $content_start (06), storing text as content inside the rage of saved positions (09), moving current DOM position to the predefined place (15), denoting the end of an article (16) or the end of the processing and so on. There is also a simple if-else-style flow control in such scripts. Third column shows that if, for instance, (01) task succeeded then the next task should be performed ([next]). Otherwise, parsing of this document will be finished ([END_OF_DOCUMENT]).

Here is a list of available HTML2XML commands with description:

Task	Description
FIND_NODE	Search a DOM node against given template and move current position to it if it is found. If the DOM node is not found then current position stays unmodified.
STORE_TEXT	Label and store text within specified range of DOM positions.
STORE_NODE	Store defined text.
STORE_ISO_DATE	Parse a dateline into ISO-8601 format and store it.
SAVE_POS	Save current position into a variable.
SET_POS	Set current position in a DOM tree.
GOTO_TASK	Jump to other task.
INVOKE_METHOD	Invoke some Java method.
END_OF_ARTICLE	Label the end of an article.
END_OF_DOCUMENT	Label the end of the parsing process.
EMPTY_TASK	Do nothing
SAVE_TEXT	Label and store text within specified range of DOM positions into some variable.
STORE_LINKS	Store links inside given DOM positions.

In the package distribution Html2Xml there is a html2xml.jar program that can be used in two ways to perform HTML2XML processing:

Batch mode. Each HTML file in the directory specified by -dir should have a corresponded meta file. Check test directory in the distributed package for an example of such files. The following command will process all files in test/searchengineland.com directory:
shell> java -jar html2xml.jar -cmd batch -dir test/searchengineland.com
To process directories recursively use switch -R:
shell> java -jar html2xml.jar -cmd batch -dir test/ -R
Server mode. Start a server by (change port_number to some port number):
shell> java -jar html2xml.jar -cmd server -p port_number

And send TANA messages to it with the following key-value pairs:

Required	Key	Value	Description
yes	content	string	HTML source to parse.
yes	script	string	A script to parse given content.
yes	url	string	UTF-8 encoded URL of the HTML source.
no	charset	string	Character encoding of the HTML content. If not specified then encoding of the content will be detected.
no	xml-charset	string	Desired encoding of the resulted XML files. By default is UTF-8.
no	meta	string	Content-type headers from HTTP response of the content.

Developer manual

Now let us look into this component from developer’s point of view.

To include HTML2XML library to your own code, follow this template:

// 1. Allocate necessary objects DefaultScriptRunner scriptRunnerImpl = new DefaultScriptRunner(); NewsXMLResultFormatter xmlResultFormatter = new NewsXMLResultFormatter(); // html2xmlPropFile - path to html2xml.properties file // scriptsDir - directory of scripts location ScriptLoader scriptLoader = new ScriptLoader(html2xmlPropFile, scriptsDir); // 2. get list of available scripts for given uri List<String> scriptList = scriptLoader.getScripts(uri); if (scriptList == null || scriptList.size() == 0) return null; String xml = null; // resulted xml for (String script : scriptList) { // try each script to run try { // 3. htmlSource - string of html content to parse List<Task> tasks = scriptRunnerImpl.performScript(htmlSource, uri, new StringScriptParser( script)); // xmlEncoding - desired encoding of XML file xml = xmlResultFormatter.format(tasks, xmlEncoding, scriptRunnerImpl); // 4. format result as xml } catch (IOException e) { } if (xml != null) // 5. break processing if current script succeeded break; }

DefaultScriptRunner class is a particular implementation of HTML2XML transformer that performs tasks. NewsXMLResultFormatter does formatting of processed tasks as XML. ScriptLoader loads scripts from files. These three classes are main units to get HTML2XML component working in your application. Allocate their instances first (1). Note that ScriptLoader requires two parameters for its constructor: properties file and directory of scripts location. Properties file is a Java Properties class compliant file with a name of server as a key and corresponded script filenames as a value:

www.researchbuzz.com = researchbuzz.script
www.researchbuzz.net = researchbuzz.script;researchbuzz.script~2
www.searchenginelowdown.com = www.searchenginelowdown.com.script
searchenginewatch.com = searchenginewatch.com.script
google.blognewschannel.com = google.blognewschannel.script
searchengineherald.com = searchengineherald.com.script

This properties file assigns script to a particular server. For instance, any page from www.searchenginelowdown.com host will be processed by script www.searchenginelowdown.com.script that should be located in scriptsDir directory. If there is at least one script for this site then it will be performed against HTML content using DefaultScriptRunner (3). Performed tasks are then formated to XML by NewsXMLResultFormatter (4). In addition, do not forget to break processing of other scripts if you already satisfied by current result (5).

License

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.