Nutch 2.X Tutorial

Introduction This document describes how to get Nutch 2.X to use HBase as a storage backend for Gora. It is assumed that you have a working knowledge of configuring Nutch 1.X, as currently configuration in 2.X is more complex. It is important to take this in to consideration before progressing any further. We therefore strongly…

Storm-crawler–the scalable spider

Storm-crawler is an open source SDK for building distributed web crawlers with Apache Storm. The project is under Apache license v2 and consists of a collection of reusable resources and components, written mostly in Java. The aim of Storm-crawler is to help build web crawlers that are : scalable resilient low latency easy to extend…

What is Apache Nutch

Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene, the project has diversified and now comprises two codebases, namely: Nutch 1.x: A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch processing. Nutch 2.x:…

Larbin: Multi-purpose web crawler

Introduction Larbin is a web crawler (also called (web) robot, spider, scooter…). It is intended to fetch a large number of web pages to fill the database of a search engine. With a network fast enough, Larbin should be able to fetch more than 100 millions pages on a standard PC. Larbin is (just) a…

