🐝 HiveMind Crawler

A distributed web crawler system with intelligent Hive controller for scalable web scraping and data collection.

🚀 Features

🐝 Distributed Architecture: Multi-worker crawler system with centralized Hive controller
⚡ High Performance: Built with Fastify for blazing-fast API responses
🔄 Queue Management: Redis-backed job queues with BullMQ for reliable task processing
🌐 Web Scraping: Cheerio-powered HTML parsing and data extraction
📊 Real-time Monitoring: Comprehensive logging and system health tracking
🔧 Modular Design: Clean separation of concerns with configurable components
🛡️ Error Handling: Robust error recovery and graceful shutdown procedures

📋 Prerequisites

Node.js 20.0.0 or higher
Redis server running locally or remotely
Modern browser for web interface (optional)

🛠️ Installation

# Clone the repository
git clone https://github.com/NeaByteLab/HiveMind-Crawler.git
cd HiveMind-Crawler

# Install dependencies
npm install

# Configure Redis connection (see Configuration section)

⚙️ Configuration

Environment Variables

Set these environment variables for configuration:

# Redis Configuration
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_PASSWORD=
REDIS_DB=0

# Crawler Configuration
CRAWLER_TIMEOUT=10000
CRAWLER_MAX_RETRIES=3
CRAWLER_USER_AGENT=HiveMind-Crawler/1.0
CRAWLER_MAX_CONCURRENT=2
CRAWLER_MAX_DEPTH=3
CRAWLER_HEARTBEAT_INTERVAL=10000
CRAWLER_DEAD_TIMEOUT=30000

🚀 Usage

Development Mode

npm run dev

Production Mode

npm start

Testing

# Run tests
npm test

# Run tests in watch mode
npm run test:watch

# Run tests with coverage
npm run test:coverage

Code Quality

# Lint code
npm run lint

# Fix linting issues
npm run lint:fix

# Check linting status
npm run lint:check

🏗️ Architecture

🐝 Hive Controller

The central intelligence that coordinates all crawler workers:

Task Distribution: Assigns crawling jobs to available workers
Health Monitoring: Tracks worker status and performance
Load Balancing: Distributes load across multiple workers
Fault Tolerance: Handles worker failures and recovery
Domain Assignment: Manages domain-specific crawler assignments

🕷️ Crawler Worker

Individual worker processes that perform web scraping:

URL Processing: Handles various URL formats and redirects
Content Extraction: Parses HTML and extracts structured data
Rate Limiting: Respects robots.txt and implements polite crawling
Data Storage: Queues extracted data for processing
Domain Filtering: Processes only assigned domains

🌐 API Server

RESTful API for system management and monitoring:

Job Management: Submit, monitor, and control crawling jobs
System Status: Real-time health and performance metrics
Configuration: Runtime configuration updates
Data Access: Retrieve crawling results and statistics

📁 Project Structure

HiveMind-Crawler/
├── src/
│   ├── hive/
│   │   └── controller.js      # 🐝 Hive controller logic
│   ├── crawler/
│   │   └── worker.js          # 🕷️ Crawler worker implementation
│   ├── api/
│   │   └── server.js          # 🌐 Fastify API server
│   ├── config/
│   │   ├── redis.js           # 🔧 Redis configuration
│   │   ├── crawler.js         # 🕷️ Crawler settings
│   │   └── queues.js          # 📋 Queue configuration
│   ├── utils/
│   │   ├── logger.js          # 📝 Logging utilities
│   │   └── url.js             # 🔗 URL processing utilities
│   └── index.js               # 🚀 Main application entry point
├── __tests__/                 # 🧪 Test files
├── package.json               # 📦 Project dependencies
├── eslint.config.js           # 🔍 ESLint configuration
├── jest.config.js             # 🧪 Jest configuration
├── jest.setup.js              # 🧪 Jest setup
└── README.md                  # 📖 This file

🔧 Dependencies

Core Dependencies

Fastify: High-performance web framework
BullMQ: Redis-based job queue system
Cheerio: Server-side jQuery for HTML parsing
Axios: HTTP client for web requests
Redis: In-memory data structure store
UUID: Unique identifier generation

Development Dependencies

ESLint: Code linting and formatting
ESLint Plugin Unused Imports: Remove unused imports automatically
Jest: Testing framework

📊 API Endpoints

🕷️ Crawling Operations

POST /crawl - Add URLs to crawl queue
- Body: { urls: string|string[], priority?: 'high'|'normal', domain?: string }
- Response: { message: string }

🔍 System Monitoring

GET /health - System health check
- Response: { status: 'healthy', timestamp: number }
GET /metrics - Comprehensive system metrics
- Response: { activeCrawlers, queueSize, totalProcessed, totalFailed, avgResponseTime, crawlers[] }
GET /crawlers - List all registered crawlers
- Response: Array<{ id, status, capabilities, stats, assignedDomains, last_update }>

📊 Results & Data

GET /results?url=<url> - Get crawl result for specific URL
- Query: url - The URL to retrieve results for
- Response: Object|null - Crawl result or null if not found

🎯 Domain Management

POST /assign-domain - Assign domain to specific crawler
- Body: { crawlerId: string, domain: string }
- Response: { message: string }

🗑️ Queue Management

DELETE /queue - Clear all crawl queues
- Response: { message: 'Queue cleared' }

💡 Usage Examples

Add URLs to Crawl

# Single URL
curl -X POST http://localhost:3000/crawl \
  -H "Content-Type: application/json" \
  -d '{"urls": "https://example.com", "priority": "high"}'

# Multiple URLs
curl -X POST http://localhost:3000/crawl \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com", "https://test.com"], "priority": "normal"}'

Check System Health

curl http://localhost:3000/health

Get System Metrics

curl http://localhost:3000/metrics

Get Crawl Results

curl "http://localhost:3000/results?url=https://example.com"

Assign Domain to Crawler

curl -X POST http://localhost:3000/assign-domain \
  -H "Content-Type: application/json" \
  -d '{"crawlerId": "crawler-1", "domain": "example.com"}'

🐛 Troubleshooting

Common Issues

Redis Connection Failed

# Ensure Redis is running
redis-server

# Check connection
redis-cli ping

Worker Not Starting

# Check logs
npm run dev

# Verify Redis connection
# Check configuration files

High Memory Usage

Reduce CRAWLER_MAX_CONCURRENT in environment variables
Implement data streaming for large datasets
Monitor Redis memory usage

🧪 Testing

The project includes comprehensive tests:

# Run all tests
npm test

# Run tests in watch mode
npm run test:watch

# Run tests with coverage
npm run test:coverage

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
__tests__		__tests__
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eslint.config.js		eslint.config.js
jest.config.js		jest.config.js
jest.setup.js		jest.setup.js
package.json		package.json

Folders and files

Latest commit

History

Repository files navigation