WebNews Crawler is a Java application to crawl (download, fetch) resources via HTTP. You can use it as a generic crawler to download WEB pages from Internet. It has a set of filters to limit and focus your crawling process. In addition WebNews Crawler comes with a powerful HTML2XML library that can extract desired data from HTML pages and represent it in XML format. Together with ability to parse RSS feeds this crawler is useful for acquiring and cleaning WEB news articles.
This software is used as a part of ALVIS
search
engines:
http://wikipedia.hiit.fi/searchenginenews/front
http://wikipedia.hiit.fi/searchenginenews/background.html
See
more about ALVIS project here:
http://www.alvis.info/alvis/
WebNews Crawler is a pure Java application. You need Java Runtime Enviroment (jre) version 1.5.x or higher to be installed and configured to run it. This software is platform independent and tested on Windows and Linux boxes. No special installation for the package itself is required. All you need is to download and unpack WebNews Crawler package archive into some location on your hard drive.
Prepare list of URLs to crawl.
Fill 'bin/news-rss.crl'
file by your tasks.
Modify configuration files according to your needs.
Open
and edit 'bin/conf/crawler.properties'
file. It contains all important parameters related to the crawling
process.
(a) Start the crawler in a console (shell>
here is your command prompt)
cd
to the 'bin/'
directory of the package:
shell>
cd unpacked_pakage_dir/bin
Start the crawler process and
wait until it is finished:
shell>
java -jar webnews-crawler.jar -cmd start
(b) If you plan to crawl
a lot of URLs then consider to use a server mode of the crawler
instead of step 3(a). To do this start the crawler server (change
12345 to some port number):
shell>
java -jar webnews-crawler.jar -cmd server -p 12345
Use
TanaSend.jar to send a 'start'
command to the server (Read more
about communication protocol):
shell>
java -jar TanaSend.jar localhost:12345 cmd:start
Now you
can control and monitor the crawling process by sending commands
listed below:
cmd: |
Description |
start |
Start a crawling process. |
pause |
Pause a crawling process. |
resume |
Resume a crawling process after it has been stopped. |
stop |
Stop a crawling process. Cannot be resumed. |
shutdown |
Shutdown the crawler. |
exit |
Interrupt a crawling process. |
stat |
Get statistic from the crawler. |
log |
Change log4j properties file and reload it. Requires a parameter 'file'. |
In step 3 the crawler has downloaded some content and stored
it inside its internal database. Here is a way to export this
database:
shell>
java -jar webnews-crawler.jar -cmd export
After the end of
the exporting process you will see a directory 'export-<timestamp>'
where <timestamp> is a
Unix time. In this directory each exported resource has one 'meta'
and one 'original' file with the same number as a name. In addition
if HTML2XML processing is
successful for this resource then there will be an additional 'xml'
file.
A 'meta' file
consists of meta information about corresponded resource, such as a
URL, an HTML title if there is some, detected encoding, a time stamp
of the crawling, and HTTP headers prefixed by 'Meta-'.
'original'
file is a content of fetched resource as it comes from socket, i.e.
without any conversion or modifications.
'xml'
file (if any) is a result of HTML2XML
processing.
There is also a GUI version of the crawler. It doesn't support
some features of the console version but still can be useful for a
simple crawling. Start the crawler with GUI this way:
shell>
java -jar webnews-crawler.jar -cmd gui
There is a possibility to convert all exported resources into
ALVIS
XML format. To do so, download and install ALVIS
support tools. It is also possible to do this via Perl CPAN
network:
shell>
sudo perl -MCPAN -e "install Alvis::Convert"
Use
news_xml2alvis script to process
exported from the crawler directory into set of ALVIS files.
Copyright (c) 2005 by Vladimir Poroshin. All Rights Reserved.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.