Skip to main content

RocketRAG: Performance-First Retrieval-Augmented Generation

RocketRAG is a speed-focused Retrieval-Augmented Generation framework from TheLion.ai.

Overview
#

RocketRAG is a speed-focused Retrieval-Augmented Generation framework from TheLion.ai.
It packages document ingestion, semantic chunking, vector storage, and LLM inference into a pluggable toolkit that runs as both a CLI utility and a FastAPI server.
All components are swappable, so the same pipeline can power notebooks, cron jobs, or production APIs.

Architecture Highlights
#

  • Load – Kreuzberg-based document loaders stream PDFs, Markdown, TXT, and more into a unified format.
  • Chunk – Chonkie semantic chunking (model2vec) preserves context while staying tiny enough for laptop use.
  • Vectorize – Sentence Transformers (or custom encoders) generate embeddings with batched throughput.
  • Store – Milvus Lite keeps the vector DB local, delivering sub-millisecond retrieval without extra dependencies.
  • Generate – llama-cpp-python serves quantized GGUF models for low-latency inference on commodity GPUs or CPUs.

The plugin architecture defines BaseLoader, BaseChunker, BaseVectorizer, BaseLLM, and BaseVectorDB interfaces, so adding a new chunker or embedding model is just a small class.

Usage Notes
#

  • pip install rocketrag (or uvx rocketrag ...) provides the CLI immediately.
  • rocketrag prepare --data-dir ./docs builds the vector index; rocketrag ask "prompt" queries it.
  • rocketrag server --host 0.0.0.0 --port 8000 exposes OpenAI-compatible endpoints, realtime streaming, vector visualizations, and a document browser.
  • Python users orchestrate the same pipeline via:
from rocketrag import RocketRAG
rag = RocketRAG("./data")
rag.prepare()
answer, sources = rag.ask("What are the key findings?")

Why it Matters
#

  • Optimized for speed end to end (Kreuzberg loaders, Chonkie chunking, SIMD-friendly embeddings, Milvus Lite retrieval).
  • End-user friendly with verbose CLI UX plus full FastAPI server, health checks, visualizers, and chat UI.
  • Batteries-included configuration, yet fully extensible for custom loaders, chunkers, vectorizers, or LLM runtimes.
  • Designed for reproducible deployments: consistent chunking, metadata preservation, and testing via pytest + ruff.