HTML2XML is a special component to extract data from HTML templates and represent it in XML format. This transformation is driven by special scripts that defines which parts of HTML DOM tree should be extracted and how.
To see an example of such transformation let us look at the following HTML page:
Suppose we are only interesting in the news item on this page selected by red box. Moreover, in this content it is better to recognize such things as title, date, author, news body, if they exit. HTML2XML transformer can do this, and the resulted XML file for our page will look like:
<?xml
version="1.0"
encoding="utf-8"?> |
You may notice a few things here: content represents the news article as a text with the layout close to the original HTML view; links from this content are extracted and presented in separate section links; a date of publishing is also parsed into ISO-8601 format as iso-date.
To do such transformation HTML2XML module needs a special script for each HTML template. This script is actually a set of commands to navigate in DOM tree. It has a simple syntax and an idea behind it. For instance, for our previous HTML page a script to parse it into an XML document is the following:
N |
command |
parameters |
true false |
01 |
FIND_NODE |
... div class="Post" ... |
[next][END_OF_DOCUMENT] |
02 |
SAVE_POS |
$content_start |
[next][END_OF_DOCUMENT] |
03 |
FIND_NODE |
... span class="PostTitle" |
[next][GOTO_TASK 7] |
04 |
STORE_TEXT |
title -1 -1 |
[next][END_OF_DOCUMENT] |
05 |
FIND_NODE |
... /span class="PostTitle" |
[next][GOTO_TASK 7] |
06 |
SAVE_POS |
$content_start |
[next][END_OF_DOCUMENT] |
07 |
FIND_NODE |
... span class="PostFooter" |
[next][END_OF_DOCUMENT] |
08 |
SAVE_POS |
$content_end |
[next][END_OF_DOCUMENT] |
09 |
STORE_TEXT |
content $content_start $content_end |
[next][END_OF_DOCUMENT] |
10 |
STORE_LINKS |
$content_start $content_end |
[next][END_OF_DOCUMENT] |
11 |
FIND_NODE |
div class="DateHeader" ... |
[next][END_OF_DOCUMENT] |
12 |
SAVE_TEXT |
$date -1 -1 |
[next][END_OF_DOCUMENT] |
13 |
STORE_NODE |
date $date |
[next][END_OF_DOCUMENT] |
14 |
STORE_ISODATE |
MMMMM dd',' yyyy $date |
[next][END_OF_DOCUMENT] |
15 |
SET_POS |
$content_end |
[next][END_OF_DOCUMENT] |
16 |
END_OF_ARTICLE |
|
[next][END_OF_DOCUMENT] |
17 |
GOTO_TASK |
1 |
[][] |
This script has 17 commands. Each command performs some specific task, such as finding position in a DOM tree that satisfies given template (01), saving this last found position in a variable $content_start (02), storing a node as title from the last found DOM position till the end of the node (04), again saving position in the variable $content_start (06), storing text as content inside the rage of saved positions (09), moving current DOM position to the predefined place (15), denoting the end of an article (16) or the end of the processing and so on. There is also a simple if-else-style flow control in such scripts. Third column shows that if, for instance, (01) task succeeded then the next task should be performed ([next]). Otherwise, parsing of this document will be finished ([END_OF_DOCUMENT]).
Here is a list of available HTML2XML commands with description:
Task |
Description |
FIND_NODE |
Search a DOM node against given template and move current position to it if it is found. If the DOM node is not found then current position stays unmodified. |
STORE_TEXT |
Label and store text within specified range of DOM positions. |
STORE_NODE |
Store defined text. |
STORE_ISO_DATE |
Parse a dateline into ISO-8601 format and store it. |
SAVE_POS |
Save current position into a variable. |
SET_POS |
Set current position in a DOM tree. |
GOTO_TASK |
Jump to other task. |
INVOKE_METHOD |
Invoke some Java method. |
END_OF_ARTICLE |
Label the end of an article. |
END_OF_DOCUMENT |
Label the end of the parsing process. |
EMPTY_TASK |
Do nothing |
SAVE_TEXT |
Label and store text within specified range of DOM positions into some variable. |
STORE_LINKS |
Store links inside given DOM positions. |
In the package distribution Html2Xml there is a html2xml.jar program that can be used in two ways to perform HTML2XML processing:
Batch mode. Each
HTML file in the directory specified by -dir
should have a corresponded meta file. Check test
directory in the distributed package for an example of such files.
The following command will process all files in
test/searchengineland.com
directory:
shell>
java -jar html2xml.jar -cmd batch -dir test/searchengineland.com
To
process directories recursively use switch -R:
shell>
java -jar html2xml.jar -cmd batch -dir test/ -R
Server mode. Start a server by
(change port_number
to some port number):
shell>
java -jar html2xml.jar -cmd server -p port_number
And send TANA messages to it with the following key-value pairs:
Required |
Key |
Value |
Description |
yes |
content |
string |
HTML source to parse. |
yes |
script |
string |
A script to parse given content. |
yes |
url |
string |
UTF-8 encoded URL of the HTML source. |
no |
charset |
string |
Character encoding of the HTML content. If not specified then encoding of the content will be detected. |
no |
xml-charset |
string |
Desired encoding of the resulted XML files. By default is UTF-8. |
no |
meta |
string |
Content-type headers from HTTP response of the content. |
Now let us look into this component from developer’s point of view.
To include HTML2XML library to your own code, follow this template:
|
DefaultScriptRunner
class is a particular implementation of HTML2XML transformer that
performs tasks. NewsXMLResultFormatter
does formatting of processed tasks as XML.
ScriptLoader
loads scripts from files. These three classes are main units to get
HTML2XML component working in your application. Allocate their
instances first (1
).
Note that ScriptLoader
requires two parameters for its constructor: properties file and
directory of scripts location. Properties file is a Java Properties
class compliant file with a name of server as a key and corresponded
script filenames as a value:
www.researchbuzz.com
= researchbuzz.script |
This properties file
assigns script to a particular server. For instance, any page from
www.searchenginelowdown.com
host will be processed by script www.searchenginelowdown.com.script
that should be located in scriptsDir
directory. If there is at least one script for this site then it will
be performed against HTML content using DefaultScriptRunner
(3
). Performed tasks
are then formated to XML by NewsXMLResultFormatter
(4
). In addition, do
not forget to break processing of other scripts if you already
satisfied by current result (5
).
Copyright (c) 2005 by Vladimir Poroshin. All Rights Reserved.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.