RocketRAG: Performance-First Retrieval-Augmented Generation
RocketRAG is a speed-focused Retrieval-Augmented Generation framework from TheLion.ai.
Overview #
RocketRAG is a speed-focused Retrieval-Augmented Generation framework from TheLion.ai.
It packages document ingestion, semantic chunking, vector storage, and LLM inference into a pluggable toolkit that runs as both a CLI utility and a FastAPI server.
All components are swappable, so the same pipeline can power notebooks, cron jobs, or production APIs.
Architecture Highlights #
- Load – Kreuzberg-based document loaders stream PDFs, Markdown, TXT, and more into a unified format.
- Chunk – Chonkie semantic chunking (model2vec) preserves context while staying tiny enough for laptop use.
- Vectorize – Sentence Transformers (or custom encoders) generate embeddings with batched throughput.
- Store – Milvus Lite keeps the vector DB local, delivering sub-millisecond retrieval without extra dependencies.
- Generate – llama-cpp-python serves quantized GGUF models for low-latency inference on commodity GPUs or CPUs.
The plugin architecture defines BaseLoader, BaseChunker, BaseVectorizer, BaseLLM, and BaseVectorDB interfaces, so adding a new chunker or embedding model is just a small class.
Usage Notes #
pip install rocketrag(oruvx rocketrag ...) provides the CLI immediately.rocketrag prepare --data-dir ./docsbuilds the vector index;rocketrag ask "prompt"queries it.rocketrag server --host 0.0.0.0 --port 8000exposes OpenAI-compatible endpoints, realtime streaming, vector visualizations, and a document browser.- Python users orchestrate the same pipeline via:
from rocketrag import RocketRAG
rag = RocketRAG("./data")
rag.prepare()
answer, sources = rag.ask("What are the key findings?")
Why it Matters #
- Optimized for speed end to end (Kreuzberg loaders, Chonkie chunking, SIMD-friendly embeddings, Milvus Lite retrieval).
- End-user friendly with verbose CLI UX plus full FastAPI server, health checks, visualizers, and chat UI.
- Batteries-included configuration, yet fully extensible for custom loaders, chunkers, vectorizers, or LLM runtimes.
- Designed for reproducible deployments: consistent chunking, metadata preservation, and testing via
pytest+ruff.