Web content harvesting encompassing retrieval and parsing of content is generally a pre-cursor to analysis tasks such as search, placement of ads, and relevance ranking. As of part of this project I developed a distributed content harvester that uses thread-pools to retrieve and parse content.

Features that this distributed harvester supports include: duplicate task elimination, task handoffs between distributed harvesters, and configurable thresholds for sizing the thread pools and controlling recursion depths during crawling. The harvester can also detect disjoint sub-graphs and broken links within a particular web domain.

Crawler class

Thread Pool class

eemove - Capistrano'esk automated deployment for ExpressionEngine

Automates the task of pushing/pulling website resource and databases across various environments Continue reading

Distributed Hash Table

Published on February 24, 2014