A from-scratch implementation of a document search engine using TF-IDF vectorisation and cosine similarity. Built entirely in Python programming language with no pre-built search libraries.
This search engine indexes classic literature and enables intelligent searches across multiple documents. The system preprocesses text, builds TF-IDF representations, and ranks documents using cosine similarity to deliver relevant results efficiently.
Key Features:
- Custom TF-IDF vectoriser implementation from scratch
- Configurable text preprocessing (stemming, stopword removal)
- Cosine similarity ranking for document retrieval
- Evaluation metrics (Precision@K, Recall@K, Average Precision)
- Interactive search interface
- Python 3.9+
- NumPy - Numerical computations and vector operations
- Pandas - Data manipulation
- NLTK - Tokenisation and linguistic preprocessing
- Scikit-learn - Validation and comparison benchmarks
- Matplotlib - Evaluation visualizations
python-text-search-engine/
├── src/
│ ├── loader.py # Document loading and management
│ ├── preprocessing.py # Text cleaning and tokenisation
│ ├── vectorizer.py # TF-IDF implementation
│ ├── search.py # Search engine with cosine similarity
│ └── evaluation.py # Performance metrics
├── data/
│ └── raw_texts/ # Classic literature corpus
├── demo_search.py # Interactive search interface
├── test_*.py # Unit tests for each module
└── requirements.txt
- Python 3.9 or higher
- Git
- Windows/Linux/macOS compatible
- Clone the repository:
git clone https://github.com/ConstantlyTrying989/python-text-search-engine.git
cd python-text-search-engine- Create virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate.bat- Install dependencies:
pip install -r requirements.txt- Download NLTK data (for the first run):
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"Run the interactive search demo:
python demo_search.pyExample queries to try:
detective mystery crimeocean whale adventurelove romance marriage
The system loads plain text documents, indexing a collection of classic novels.
- Converts text to lowercase
- Removes URLs, emails, and numbers
- Tokenizes using NLTK's word_tokenize
- Removes stopwords (common words like "and")
- Applies Porter stemming to reduce words to root forms
Implements Term Frequency-Inverse Document Frequency from scratch:
- TF (Term Frequency): Normalized word frequency in document
- IDF (Inverse Document Frequency): log(N / document_frequency)
- Creates sparse vector representation for each document
Compares query vector to document vectors using cosine similarity:
similarity = (A · B) / (||A|| × ||B||)
Returns top K most similar documents ranked by score.
- Precision@K: Accuracy of top K results
- Recall@K: Coverage of relevant documents in top K
- Average Precision: Overall ranking quality
Run all tests:
python test_loader.py
python test_preprocessing.py
python test_vectorizer.py
python test_search.py
python test_evaluation.pyExpected output:
- Document loading verification
- Preprocessing statistics
- TF-IDF matrix dimensions
- Search result rankings
- Evaluation metric scores
The search engine efficiently processes queries across the indexed corpus:
- Optimized TF-IDF computation
- Quick document retrieval and ranking
- Sub-second average query time
- Scalable vector operations using NumPy
Sample Results: Query-dependent results ranked by cosine similarity scores.
Custom Implementation:
- No use of sklearn's TfidfVectorizer
- Manual cosine similarity computation
- Custom evaluation metrics for testing
Algorithmic Complexity:
- Indexing: O(N × M) where N = documents, M = avg tokens
- Query: O(V) where V = vocabulary size
- Memory: O(N × V) for document-term matrix
This project demonstrates:
- Information retrieval fundamentals (TF-IDF, cosine similarity)
- Text preprocessing and NLP pipeline design
- NumPy for efficient vector operations
- Software engineering best practices (testing)
- Working with real-world text data
- Implement search for synonyms in the texts
- Add phrase search support
- Add relevance feedback mechanism
This is a portfolio project, but suggestions are welcome! Feel free to open an issue or submit a pull request.
This project is open source and available under the MIT License.
Jason Lewis
- GitHub: @ConstantlyTrying989
- LinkedIn: Jason Lewis
- Email: lewisjd2007@gmail.com
- Project Gutenberg for public domain texts
- NLTK team for NLP tools
- Classic literature authors for the corpus