LameBOT home
LameBOT is a distributed internet crawling system
designed with the objectives of scalability over a network
and efficiency in performance. It uses the excellent libcurl - a free multi-protocol
file transfer library (cURL homepage), uriparser - a strictly RFC
3986 compliant cross-platform URI parsing library (released under the New
BSD License ).
LameBOT system consists of four
component applications as of now, namely - SCentralURLPool, SCentralGuard,
SCentralIndexer and SPeripheralSpider.
As the name implies, SCentralURLPool
implements a queue based server for "push"-ing and "pop"-ing URLs to be explored
by the SPeripheralSpider application. The design philosophy is that a single
SCentralURLPool instance would serve several SPeripheralSpider-s most probably
running on several computers which are connected together by a fast network
(100Mbps LAN for instance).
The SCentralGuard application is designed on a
one-many server architechture similar to SCentralURLPool and it prevents
over-representation of URLs of a specific site in the collection of explored
URLs.
The SCentralIndexer application ensures that SPeripheralSpider(s) do
not repeatedly explore URLs already explored, thereby reducing redundancy. This
application implements a many-many server architechture and can form a ring
of servers .
What the system currently lacks is a database system for
holding the contents of the explored URLs, a BOT exclusion processing system and
a type filter implementation for followed URLs, support for large number of URL
exploration (by using on disk file memory maps etc.). The number of URLs that
can be explored are currently limited by the amount of RAM present
on the system (running one instance of all four applications together on one machine,
having a 512MB RAM, one can explore around a million pages), also there
is no implementation of relative URL exploration only direct, absolute links are
followed. Moreover, although the system is designed for a networked implementation, it does not provide
much support for network pipe implementations apart from a NULL
DACL.