RAG Search Engine

A hybrid search engine for movie data implementing BM25 keyword search, semantic embedding search, and CLIP multimodal search, fused with Reciprocal Rank Fusion (RRF) or weighted combination. Layered on top is a full AI pipeline: Gemini-powered query enhancement, three reranking strategies, and a Retrieval-Augmented Generation (RAG) module for question answering and summarisation.

Repository: https://github.com/armoutihansen/rag-search-engine

Goal

Build a production-grade RAG search pipeline from scratch — implementing each component (inverted index, embedding search, query rewriting, reranking, generation) manually to understand the trade-offs between approaches rather than relying on a single end-to-end framework.

Scope (In / Out)

In:

  • BM25 inverted index with Porter stemming and stopword removal
  • Sentence-transformer semantic search (all-MiniLM-L6-v2) with chunked documents
  • CLIP (ViT-B-32) multimodal image-based search
  • Hybrid fusion: weighted α-combination and Reciprocal Rank Fusion (RRF)
  • AI query enhancement: spell correction, rewrite, expansion, and image-to-query (Gemini 2.5 Flash)
  • Three reranking strategies: individual scoring, batch ranking, cross-encoder
  • RAG module: Q&A, summarisation, and citation-based answer generation
  • Evaluation framework: Precision@k, Recall@k, F1 against golden dataset
  • Persistent cache (index + embeddings to disk)

Out:

  • Live crawling or document ingestion pipeline
  • Multi-turn conversational search
  • Fine-tuned retrieval models
  • Production deployment (no API server)
  • Datasets beyond movies

Deliverables

ArtefactDescription
cli/lib/keyword_search.pyBM25 inverted index with stemming, stopwords, score normalisation
cli/lib/semantic_search.pyChunked sentence-transformer search with embedding caching
cli/lib/multimodal_search.pyCLIP image embedding search
cli/lib/hybrid_search.pyRRF and weighted fusion; Gemini query enhancement and reranking
cli/lib/augmented_generation.pyRAG — Q&A, summaries, citations from retrieved context
cli/*_cli.pyCLI entry points for each search mode
cli/evaluation_cli.pyEvaluation harness — Precision@k, Recall@k, F1
data/golden_dataset.jsonHand-curated test cases for evaluation
tests/pytest suite covering all core modules

Data

DatasetDescription
data/movies.jsonMovie corpus: titles, descriptions, genres, poster URLs
data/golden_dataset.jsonCurated query–relevant-document pairs for evaluation

The movie dataset serves as a controlled, semantically rich domain for testing retrieval strategies. Poster URLs enable multimodal image-based search.

Architecture

Retrieval Layer

ComponentMethodModel / Params
KeywordBM25k1=1.5, b=0.75, Porter stemmer
SemanticDense retrievalall-MiniLM-L6-v2 (384-dim), chunked docs
MultimodalImage embeddingCLIP ViT-B-32
FusionRRF, k=60
FusionWeighted

AI Enhancement Pipeline

StageMethodNotes
Query enhancementGemini 2.5 Flashspell correction, rewrite, expansion
Image-to-queryGemini 2.5 Flash + CLIPdescribe image, embed description
Reranking (individual)Gemini 2.5 Flash1 API call per document, 0–10 score
Reranking (batch)Gemini 2.5 Flash1 API call per page, ranked list output
Reranking (cross-encoder)cross-encoder/ms-marco-MiniLM-L6-v2Local inference, no API cost
RAG generationGemini 2.5 FlashQ&A, summaries, citation anchors

Evaluation

Metrics computed against golden_dataset.json:

  • Precision@k: fraction of top-k results that are relevant
  • Recall@k: fraction of relevant documents found in top-k
  • F1@k: harmonic mean of precision and recall at k

Engineering

ConcernChoice
LanguagePython 3.13
LLMGemini 2.5 Flash (google-genai>=1.65.0)
Semantic modelsentence-transformers>=5.2.3
Multimodal modelCLIP via sentence-transformers
Text processingnltk==3.9.1 (Porter stemmer, stopwords)
Numericsnumpy>=2.4.2
Env managementuv + .env via python-dotenv
Cachingcache/*.pkl — persists BM25 index + embeddings
Testingpytest>=8.0.0
Expansion factor500× initial retrieval before reranking

Timeline

Completed (learning / portfolio project).

References

  • Robertson, S. E., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval.
  • Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS.
  • Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). ICML.