Skip to content

NeaByteLab/HiveMind-Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🐝 HiveMind Crawler

Node.js ESLint License: MIT

A distributed web crawler system with intelligent Hive controller for scalable web scraping and data collection.

πŸš€ Features

  • 🐝 Distributed Architecture: Multi-worker crawler system with centralized Hive controller
  • ⚑ High Performance: Built with Fastify for blazing-fast API responses
  • πŸ”„ Queue Management: Redis-backed job queues with BullMQ for reliable task processing
  • 🌐 Web Scraping: Cheerio-powered HTML parsing and data extraction
  • πŸ“Š Real-time Monitoring: Comprehensive logging and system health tracking
  • πŸ”§ Modular Design: Clean separation of concerns with configurable components
  • πŸ›‘οΈ Error Handling: Robust error recovery and graceful shutdown procedures

πŸ“‹ Prerequisites

  • Node.js 20.0.0 or higher
  • Redis server running locally or remotely
  • Modern browser for web interface (optional)

πŸ› οΈ Installation

# Clone the repository
git clone https://github.com/NeaByteLab/HiveMind-Crawler.git
cd HiveMind-Crawler

# Install dependencies
npm install

# Configure Redis connection (see Configuration section)

βš™οΈ Configuration

Environment Variables

Set these environment variables for configuration:

# Redis Configuration
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_PASSWORD=
REDIS_DB=0

# Crawler Configuration
CRAWLER_TIMEOUT=10000
CRAWLER_MAX_RETRIES=3
CRAWLER_USER_AGENT=HiveMind-Crawler/1.0
CRAWLER_MAX_CONCURRENT=2
CRAWLER_MAX_DEPTH=3
CRAWLER_HEARTBEAT_INTERVAL=10000
CRAWLER_DEAD_TIMEOUT=30000

πŸš€ Usage

Development Mode

npm run dev

Production Mode

npm start

Testing

# Run tests
npm test

# Run tests in watch mode
npm run test:watch

# Run tests with coverage
npm run test:coverage

Code Quality

# Lint code
npm run lint

# Fix linting issues
npm run lint:fix

# Check linting status
npm run lint:check

πŸ—οΈ Architecture

🐝 Hive Controller

The central intelligence that coordinates all crawler workers:

  • Task Distribution: Assigns crawling jobs to available workers
  • Health Monitoring: Tracks worker status and performance
  • Load Balancing: Distributes load across multiple workers
  • Fault Tolerance: Handles worker failures and recovery
  • Domain Assignment: Manages domain-specific crawler assignments

πŸ•·οΈ Crawler Worker

Individual worker processes that perform web scraping:

  • URL Processing: Handles various URL formats and redirects
  • Content Extraction: Parses HTML and extracts structured data
  • Rate Limiting: Respects robots.txt and implements polite crawling
  • Data Storage: Queues extracted data for processing
  • Domain Filtering: Processes only assigned domains

🌐 API Server

RESTful API for system management and monitoring:

  • Job Management: Submit, monitor, and control crawling jobs
  • System Status: Real-time health and performance metrics
  • Configuration: Runtime configuration updates
  • Data Access: Retrieve crawling results and statistics

πŸ“ Project Structure

HiveMind-Crawler/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ hive/
β”‚   β”‚   └── controller.js      # 🐝 Hive controller logic
β”‚   β”œβ”€β”€ crawler/
β”‚   β”‚   └── worker.js          # πŸ•·οΈ Crawler worker implementation
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   └── server.js          # 🌐 Fastify API server
β”‚   β”œβ”€β”€ config/
β”‚   β”‚   β”œβ”€β”€ redis.js           # πŸ”§ Redis configuration
β”‚   β”‚   β”œβ”€β”€ crawler.js         # πŸ•·οΈ Crawler settings
β”‚   β”‚   └── queues.js          # πŸ“‹ Queue configuration
β”‚   β”œβ”€β”€ utils/
β”‚   β”‚   β”œβ”€β”€ logger.js          # πŸ“ Logging utilities
β”‚   β”‚   └── url.js             # πŸ”— URL processing utilities
β”‚   └── index.js               # πŸš€ Main application entry point
β”œβ”€β”€ __tests__/                 # πŸ§ͺ Test files
β”œβ”€β”€ package.json               # πŸ“¦ Project dependencies
β”œβ”€β”€ eslint.config.js           # πŸ” ESLint configuration
β”œβ”€β”€ jest.config.js             # πŸ§ͺ Jest configuration
β”œβ”€β”€ jest.setup.js              # πŸ§ͺ Jest setup
└── README.md                  # πŸ“– This file

πŸ”§ Dependencies

Core Dependencies

  • Fastify: High-performance web framework
  • BullMQ: Redis-based job queue system
  • Cheerio: Server-side jQuery for HTML parsing
  • Axios: HTTP client for web requests
  • Redis: In-memory data structure store
  • UUID: Unique identifier generation

Development Dependencies

  • ESLint: Code linting and formatting
  • ESLint Plugin Unused Imports: Remove unused imports automatically
  • Jest: Testing framework

πŸ“Š API Endpoints

πŸ•·οΈ Crawling Operations

  • POST /crawl - Add URLs to crawl queue
    • Body: { urls: string|string[], priority?: 'high'|'normal', domain?: string }
    • Response: { message: string }

πŸ” System Monitoring

  • GET /health - System health check
    • Response: { status: 'healthy', timestamp: number }
  • GET /metrics - Comprehensive system metrics
    • Response: { activeCrawlers, queueSize, totalProcessed, totalFailed, avgResponseTime, crawlers[] }
  • GET /crawlers - List all registered crawlers
    • Response: Array<{ id, status, capabilities, stats, assignedDomains, last_update }>

πŸ“Š Results & Data

  • GET /results?url=<url> - Get crawl result for specific URL
    • Query: url - The URL to retrieve results for
    • Response: Object|null - Crawl result or null if not found

🎯 Domain Management

  • POST /assign-domain - Assign domain to specific crawler
    • Body: { crawlerId: string, domain: string }
    • Response: { message: string }

πŸ—‘οΈ Queue Management

  • DELETE /queue - Clear all crawl queues
    • Response: { message: 'Queue cleared' }

πŸ’‘ Usage Examples

Add URLs to Crawl

# Single URL
curl -X POST http://localhost:3000/crawl \
  -H "Content-Type: application/json" \
  -d '{"urls": "https://example.com", "priority": "high"}'

# Multiple URLs
curl -X POST http://localhost:3000/crawl \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com", "https://test.com"], "priority": "normal"}'

Check System Health

curl http://localhost:3000/health

Get System Metrics

curl http://localhost:3000/metrics

Get Crawl Results

curl "http://localhost:3000/results?url=https://example.com"

Assign Domain to Crawler

curl -X POST http://localhost:3000/assign-domain \
  -H "Content-Type: application/json" \
  -d '{"crawlerId": "crawler-1", "domain": "example.com"}'

πŸ› Troubleshooting

Common Issues

Redis Connection Failed

# Ensure Redis is running
redis-server

# Check connection
redis-cli ping

Worker Not Starting

# Check logs
npm run dev

# Verify Redis connection
# Check configuration files

High Memory Usage

  • Reduce CRAWLER_MAX_CONCURRENT in environment variables
  • Implement data streaming for large datasets
  • Monitor Redis memory usage

πŸ§ͺ Testing

The project includes comprehensive tests:

# Run all tests
npm test

# Run tests in watch mode
npm run test:watch

# Run tests with coverage
npm run test:coverage

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Distributed web crawler with Hive controller - High performance, scalable web crawling system with Redis queue management and BullMQ job processing

Topics

Resources

License

Stars

Watchers

Forks

Contributors