HTML2XML

HTML2XML is a special component to extract data from HTML templates and represent it in XML format. This transformation is driven by special scripts that defines which parts of HTML DOM tree should be extracted and how.

To see an example of such transformation let us look at the following HTML page:

Suppose we are only interesting in the news item on this page selected by red box. Moreover, in this content it is better to recognize such things as title, date, author, news body, if they exit. HTML2XML transformer can do this, and the resulted XML file for our page will look like:

<?xml version="1.0" encoding="utf-8"?>
<DOCUMENT>
<article>
<content>AOL is struggling and it looks like Google might do for them, what Overture did for Yahoo. Some interesting facts in this article including:

Paid search is becoming a force to be reckoned with and is expected to reach $5 billion in revenue by 2008. (explains Yahoo's acquisition of Overture).

Google will add $28 million to AOL's revenue this year, rising to $76 million by 2007.

AOL is reportedly considering placing Google's Adwords throughout the AOL network. What's stopping them? They need a way to take advantage of the reducing subscriber numbers before they lose them all.
Can Google save America Online? | CNET News.com</content>
<links>
<link>http://rss.com.com/2100-1024_3-
053461.html?type=pt&amp;part=rss&amp;tag=feed&amp;subj=news</link>
</links>
<date>July 24, 2003</date>
<iso-date>2003-07-24</iso-date>
</article>
</DOCUMENT>

You may notice a few things here: content represents the news article as a text with the layout close to the original HTML view; links from this content are extracted and presented in separate section links; a date of publishing is also parsed into ISO-8601 format as iso-date.

To do such transformation HTML2XML module needs a special script for each HTML template. This script is actually a set of commands to navigate in DOM tree. It has a simple syntax and an idea behind it. For instance, for our previous HTML page a script to parse it into an XML document is the following:

N

command

parameters

true false

01

FIND_NODE

... div class="Post" ...

[next][END_OF_DOCUMENT]

02

SAVE_POS

$content_start

[next][END_OF_DOCUMENT]

03

FIND_NODE

... span class="PostTitle"

[next][GOTO_TASK 7]

04

STORE_TEXT

title -1 -1

[next][END_OF_DOCUMENT]

05

FIND_NODE

... /span class="PostTitle"

[next][GOTO_TASK 7]

06

SAVE_POS

$content_start

[next][END_OF_DOCUMENT]

07

FIND_NODE

... span class="PostFooter"

[next][END_OF_DOCUMENT]

08

SAVE_POS

$content_end

[next][END_OF_DOCUMENT]

09

STORE_TEXT

content $content_start $content_end

[next][END_OF_DOCUMENT]

10

STORE_LINKS

$content_start $content_end

[next][END_OF_DOCUMENT]

11

FIND_NODE

div class="DateHeader" ...

[next][END_OF_DOCUMENT]

12

SAVE_TEXT

$date -1 -1

[next][END_OF_DOCUMENT]

13

STORE_NODE

date $date

[next][END_OF_DOCUMENT]

14

STORE_ISODATE

MMMMM dd',' yyyy $date

[next][END_OF_DOCUMENT]

15

SET_POS

$content_end

[next][END_OF_DOCUMENT]

16

END_OF_ARTICLE


[next][END_OF_DOCUMENT]

17

GOTO_TASK

1

[][]

This script has 17 commands. Each command performs some specific task, such as finding position in a DOM tree that satisfies given template (01), saving this last found position in a variable $content_start (02), storing a node as title from the last found DOM position till the end of the node (04), again saving position in the variable $content_start (06), storing text as content inside the rage of saved positions (09), moving current DOM position to the predefined place (15), denoting the end of an article (16) or the end of the processing and so on. There is also a simple if-else-style flow control in such scripts. Third column shows that if, for instance, (01) task succeeded then the next task should be performed ([next]). Otherwise, parsing of this document will be finished ([END_OF_DOCUMENT]).

Here is a list of available HTML2XML commands with description:

Task

Description

FIND_NODE

Search a DOM node against given template and move current position to it if it is found. If the DOM node is not found then current position stays unmodified.

STORE_TEXT

Label and store text within specified range of DOM positions.

STORE_NODE

Store defined text.

STORE_ISO_DATE

Parse a dateline into ISO-8601 format and store it.

SAVE_POS

Save current position into a variable.

SET_POS

Set current position in a DOM tree.

GOTO_TASK

Jump to other task.

INVOKE_METHOD

Invoke some Java method.

END_OF_ARTICLE

Label the end of an article.

END_OF_DOCUMENT

Label the end of the parsing process.

EMPTY_TASK

Do nothing

SAVE_TEXT

Label and store text within specified range of DOM positions into some variable.

STORE_LINKS

Store links inside given DOM positions.

In the package distribution Html2Xml there is a html2xml.jar program that can be used in two ways to perform HTML2XML processing:

  1. Batch mode. Each HTML file in the directory specified by -dir should have a corresponded meta file. Check test directory in the distributed package for an example of such files. The following command will process all files in test/searchengineland.com directory:
    shell> java -jar html2xml.jar -cmd batch -dir test/searchengineland.com
    To process directories recursively use switch -R:
    shell> java -jar html2xml.jar -cmd batch -dir test/ -R

  2. Server mode. Start a server by (change port_number to some port number):
    shell> java -jar html2xml.jar -cmd server -p port_number

    And send TANA messages to it with the following key-value pairs:

Required

Key

Value

Description

yes

content

string

HTML source to parse.

yes

script

string

A script to parse given content.

yes

url

string

UTF-8 encoded URL of the HTML source.

no

charset

string

Character encoding of the HTML content. If not specified then encoding of the content will be detected.

no

xml-charset

string

Desired encoding of the resulted XML files. By default is UTF-8.

no

meta

string

Content-type headers from HTTP response of the content.

Developer manual

Now let us look into this component from developer’s point of view.

To include HTML2XML library to your own code, follow this template:

    // 1. Allocate necessary objects
    DefaultScriptRunner scriptRunnerImpl = new DefaultScriptRunner();
    NewsXMLResultFormatter xmlResultFormatter = new NewsXMLResultFormatter();
    // html2xmlPropFile - path to html2xml.properties file
    // scriptsDir - directory of scripts location
    ScriptLoader scriptLoader = new ScriptLoader(html2xmlPropFile, scriptsDir);
    // 2. get list of available scripts for given uri
    List<String> scriptList = scriptLoader.getScripts(uri);
    if (scriptList == null || scriptList.size() == 0)
      return null;
    String xml = null// resulted xml
    for (String script : scriptList) { // try each script to run
      try {
        // 3. htmlSource - string of html content to parse 
        List<Task> tasks = scriptRunnerImpl.performScript(htmlSource, uri, new StringScriptParser(
            script));
        // xmlEncoding - desired encoding of XML file 
        xml = xmlResultFormatter.format(tasks, xmlEncoding, scriptRunnerImpl); // 4. format result as xml
      catch (IOException e) {
      }
      if (xml != null// 5. break processing if current script succeeded
        break;
    }

DefaultScriptRunner class is a particular implementation of HTML2XML transformer that performs tasks. NewsXMLResultFormatter does formatting of processed tasks as XML. ScriptLoader loads scripts from files. These three classes are main units to get HTML2XML component working in your application. Allocate their instances first (1). Note that ScriptLoader requires two parameters for its constructor: properties file and directory of scripts location. Properties file is a Java Properties class compliant file with a name of server as a key and corresponded script filenames as a value:

www.researchbuzz.com = researchbuzz.script
www.researchbuzz.net = researchbuzz.script;researchbuzz.script~2
www.searchenginelowdown.com = www.searchenginelowdown.com.script
searchenginewatch.com = searchenginewatch.com.script
google.blognewschannel.com = google.blognewschannel.script
searchengineherald.com = searchengineherald.com.script

This properties file assigns script to a particular server. For instance, any page from www.searchenginelowdown.com host will be processed by script www.searchenginelowdown.com.script that should be located in scriptsDir directory. If there is at least one script for this site then it will be performed against HTML content using DefaultScriptRunner (3). Performed tasks are then formated to XML by NewsXMLResultFormatter (4). In addition, do not forget to break processing of other scripts if you already satisfied by current result (5).

License

Copyright (c) 2005 by Vladimir Poroshin. All Rights Reserved.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.