LameBOT home

LameBOT is a distributed internet crawling system designed with the objectives of scalability over a network and efficiency in performance. It uses the excellent libcurl - a free multi-protocol file transfer library (cURL homepage), uriparser - a strictly RFC 3986 compliant cross-platform URI parsing library (released under the New BSD License ).
LameBOT system consists of four component applications as of now, namely - SCentralURLPool, SCentralGuard, SCentralIndexer and SPeripheralSpider.
As the name implies, SCentralURLPool implements a queue based server for "push"-ing and "pop"-ing URLs to be explored by the SPeripheralSpider application. The design philosophy is that a single SCentralURLPool instance would serve several SPeripheralSpider-s most probably running on several computers which are connected together by a fast network (100Mbps LAN for instance).
The SCentralGuard application is designed on a one-many server architechture similar to SCentralURLPool and it prevents over-representation of URLs of a specific site in the collection of explored URLs.
The SCentralIndexer application ensures that SPeripheralSpider(s) do not repeatedly explore URLs already explored, thereby reducing redundancy. This application implements a many-many server architechture and can form a ring of servers .
What the system currently lacks is a database system for holding the contents of the explored URLs, a BOT exclusion processing system and a type filter implementation for followed URLs, support for large number of URL exploration (by using on disk file memory maps etc.). The number of URLs that can be explored are currently limited by the amount of RAM present on the system (running one instance of all four applications together on one machine, having a 512MB RAM, one can explore around a million pages), also there is no implementation of relative URL exploration only direct, absolute links are followed. Moreover, although the system is designed for a networked implementation, it does not provide much support for network pipe implementations apart from a NULL DACL.

Ads