A distributed web crawler system with intelligent Hive controller for scalable web scraping and data collection.
- π Distributed Architecture: Multi-worker crawler system with centralized Hive controller
- β‘ High Performance: Built with Fastify for blazing-fast API responses
- π Queue Management: Redis-backed job queues with BullMQ for reliable task processing
- π Web Scraping: Cheerio-powered HTML parsing and data extraction
- π Real-time Monitoring: Comprehensive logging and system health tracking
- π§ Modular Design: Clean separation of concerns with configurable components
- π‘οΈ Error Handling: Robust error recovery and graceful shutdown procedures
- Node.js 20.0.0 or higher
- Redis server running locally or remotely
- Modern browser for web interface (optional)
# Clone the repository
git clone https://github.com/NeaByteLab/HiveMind-Crawler.git
cd HiveMind-Crawler
# Install dependencies
npm install
# Configure Redis connection (see Configuration section)Set these environment variables for configuration:
# Redis Configuration
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_PASSWORD=
REDIS_DB=0
# Crawler Configuration
CRAWLER_TIMEOUT=10000
CRAWLER_MAX_RETRIES=3
CRAWLER_USER_AGENT=HiveMind-Crawler/1.0
CRAWLER_MAX_CONCURRENT=2
CRAWLER_MAX_DEPTH=3
CRAWLER_HEARTBEAT_INTERVAL=10000
CRAWLER_DEAD_TIMEOUT=30000npm run devnpm start# Run tests
npm test
# Run tests in watch mode
npm run test:watch
# Run tests with coverage
npm run test:coverage# Lint code
npm run lint
# Fix linting issues
npm run lint:fix
# Check linting status
npm run lint:checkThe central intelligence that coordinates all crawler workers:
- Task Distribution: Assigns crawling jobs to available workers
- Health Monitoring: Tracks worker status and performance
- Load Balancing: Distributes load across multiple workers
- Fault Tolerance: Handles worker failures and recovery
- Domain Assignment: Manages domain-specific crawler assignments
Individual worker processes that perform web scraping:
- URL Processing: Handles various URL formats and redirects
- Content Extraction: Parses HTML and extracts structured data
- Rate Limiting: Respects robots.txt and implements polite crawling
- Data Storage: Queues extracted data for processing
- Domain Filtering: Processes only assigned domains
RESTful API for system management and monitoring:
- Job Management: Submit, monitor, and control crawling jobs
- System Status: Real-time health and performance metrics
- Configuration: Runtime configuration updates
- Data Access: Retrieve crawling results and statistics
HiveMind-Crawler/
βββ src/
β βββ hive/
β β βββ controller.js # π Hive controller logic
β βββ crawler/
β β βββ worker.js # π·οΈ Crawler worker implementation
β βββ api/
β β βββ server.js # π Fastify API server
β βββ config/
β β βββ redis.js # π§ Redis configuration
β β βββ crawler.js # π·οΈ Crawler settings
β β βββ queues.js # π Queue configuration
β βββ utils/
β β βββ logger.js # π Logging utilities
β β βββ url.js # π URL processing utilities
β βββ index.js # π Main application entry point
βββ __tests__/ # π§ͺ Test files
βββ package.json # π¦ Project dependencies
βββ eslint.config.js # π ESLint configuration
βββ jest.config.js # π§ͺ Jest configuration
βββ jest.setup.js # π§ͺ Jest setup
βββ README.md # π This file
- Fastify: High-performance web framework
- BullMQ: Redis-based job queue system
- Cheerio: Server-side jQuery for HTML parsing
- Axios: HTTP client for web requests
- Redis: In-memory data structure store
- UUID: Unique identifier generation
- ESLint: Code linting and formatting
- ESLint Plugin Unused Imports: Remove unused imports automatically
- Jest: Testing framework
POST /crawl- Add URLs to crawl queue- Body:
{ urls: string|string[], priority?: 'high'|'normal', domain?: string } - Response:
{ message: string }
- Body:
GET /health- System health check- Response:
{ status: 'healthy', timestamp: number }
- Response:
GET /metrics- Comprehensive system metrics- Response:
{ activeCrawlers, queueSize, totalProcessed, totalFailed, avgResponseTime, crawlers[] }
- Response:
GET /crawlers- List all registered crawlers- Response:
Array<{ id, status, capabilities, stats, assignedDomains, last_update }>
- Response:
GET /results?url=<url>- Get crawl result for specific URL- Query:
url- The URL to retrieve results for - Response:
Object|null- Crawl result or null if not found
- Query:
POST /assign-domain- Assign domain to specific crawler- Body:
{ crawlerId: string, domain: string } - Response:
{ message: string }
- Body:
DELETE /queue- Clear all crawl queues- Response:
{ message: 'Queue cleared' }
- Response:
# Single URL
curl -X POST http://localhost:3000/crawl \
-H "Content-Type: application/json" \
-d '{"urls": "https://example.com", "priority": "high"}'
# Multiple URLs
curl -X POST http://localhost:3000/crawl \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com", "https://test.com"], "priority": "normal"}'curl http://localhost:3000/healthcurl http://localhost:3000/metricscurl "http://localhost:3000/results?url=https://example.com"curl -X POST http://localhost:3000/assign-domain \
-H "Content-Type: application/json" \
-d '{"crawlerId": "crawler-1", "domain": "example.com"}'Redis Connection Failed
# Ensure Redis is running
redis-server
# Check connection
redis-cli pingWorker Not Starting
# Check logs
npm run dev
# Verify Redis connection
# Check configuration filesHigh Memory Usage
- Reduce
CRAWLER_MAX_CONCURRENTin environment variables - Implement data streaming for large datasets
- Monitor Redis memory usage
The project includes comprehensive tests:
# Run all tests
npm test
# Run tests in watch mode
npm run test:watch
# Run tests with coverage
npm run test:coverage- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.