WebNews Crawler

WebNews Crawler is a Java application to crawl (download, fetch) resources via HTTP. You can use it as a generic crawler to download WEB pages from Internet. It has a set of filters to limit and focus your crawling process. In addition WebNews Crawler comes with a powerful HTML2XML library that can extract desired data from HTML pages and represent it in XML format. Together with ability to parse RSS feeds this crawler is useful for acquiring and cleaning WEB news articles.

This software is used as a part of ALVIS search engines:
http://wikipedia.hiit.fi/searchenginenews/front
http://wikipedia.hiit.fi/searchenginenews/background.html
See more about ALVIS project here:
http://www.alvis.info/alvis/

Installation

WebNews Crawler is a pure Java application. You need Java Runtime Enviroment (jre) version 1.5.x or higher to be installed and configured to run it. This software is platform independent and tested on Windows and Linux boxes. No special installation for the package itself is required. All you need is to download and unpack WebNews Crawler package archive into some location on your hard drive.

User Manual

Prepare list of URLs to crawl.
Fill 'bin/news-rss.crl' file by your tasks.
Modify configuration files according to your needs.
Open and edit 'bin/conf/crawler.properties' file. It contains all important parameters related to the crawling process.
(a) Start the crawler in a console (shell> here is your command prompt)
cd to the 'bin/' directory of the package:
shell> cd unpacked_pakage_dir/bin
Start the crawler process and wait until it is finished:
shell> java -jar webnews-crawler.jar -cmd start

(b) If you plan to crawl a lot of URLs then consider to use a server mode of the crawler instead of step 3(a). To do this start the crawler server (change 12345 to some port number):
shell> java -jar webnews-crawler.jar -cmd server -p 12345
Use TanaSend.jar to send a 'start' command to the server (Read more about communication protocol):
shell> java -jar TanaSend.jar localhost:12345 cmd:start
Now you can control and monitor the crawling process by sending commands listed below:

cmd:	Description
start	Start a crawling process.
pause	Pause a crawling process.
resume	Resume a crawling process after it has been stopped.
stop	Stop a crawling process. Cannot be resumed.
shutdown	Shutdown the crawler.
exit	Interrupt a crawling process.
stat	Get statistic from the crawler.
log	Change log4j properties file and reload it. Requires a parameter 'file'.

In step 3 the crawler has downloaded some content and stored it inside its internal database. Here is a way to export this database:
shell> java -jar webnews-crawler.jar -cmd export
After the end of the exporting process you will see a directory 'export-<timestamp>' where <timestamp> is a Unix time. In this directory each exported resource has one 'meta' and one 'original' file with the same number as a name. In addition if HTML2XML processing is successful for this resource then there will be an additional 'xml' file.
A 'meta' file consists of meta information about corresponded resource, such as a URL, an HTML title if there is some, detected encoding, a time stamp of the crawling, and HTTP headers prefixed by 'Meta-'.
'original' file is a content of fetched resource as it comes from socket, i.e. without any conversion or modifications.
'xml' file (if any) is a result of HTML2XML processing.

There is also a GUI version of the crawler. It doesn't support some features of the console version but still can be useful for a simple crawling. Start the crawler with GUI this way:
shell> java -jar webnews-crawler.jar -cmd gui

Integration with ALVIS tools

There is a possibility to convert all exported resources into ALVIS XML format. To do so, download and install ALVIS support tools. It is also possible to do this via Perl CPAN network:
shell> sudo perl -MCPAN -e "install Alvis::Convert"
Use news_xml2alvis script to process exported from the crawler directory into set of ALVIS files.

License

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.