RocketRAG: Performance-First Retrieval-Augmented Generation

RocketRAG is a speed-focused Retrieval-Augmented Generation framework from TheLion.ai.

RocketRAG is a high-performance Retrieval-Augmented Generation (RAG) system designed with a focus on speed, simplicity, and extensibility. Built on top of state-of-the-art libraries, it provides both CLI and web server capabilities for seamless integration into any workflow.

Python

Overview
#

RocketRAG is a speed-focused Retrieval-Augmented Generation framework from TheLion.ai.
It packages document ingestion, semantic chunking, vector storage, and LLM inference into a pluggable toolkit that runs as both a CLI utility and a FastAPI server.
All components are swappable, so the same pipeline can power notebooks, cron jobs, or production APIs.

Architecture Highlights
#

Load – Kreuzberg-based document loaders stream PDFs, Markdown, TXT, and more into a unified format.
Chunk – Chonkie semantic chunking (model2vec) preserves context while staying tiny enough for laptop use.
Vectorize – Sentence Transformers (or custom encoders) generate embeddings with batched throughput.
Store – Milvus Lite keeps the vector DB local, delivering sub-millisecond retrieval without extra dependencies.
Generate – llama-cpp-python serves quantized GGUF models for low-latency inference on commodity GPUs or CPUs.

The plugin architecture defines BaseLoader, BaseChunker, BaseVectorizer, BaseLLM, and BaseVectorDB interfaces, so adding a new chunker or embedding model is just a small class.

Usage Notes
#

pip install rocketrag (or uvx rocketrag ...) provides the CLI immediately.
rocketrag prepare --data-dir ./docs builds the vector index; rocketrag ask "prompt" queries it.
rocketrag server --host 0.0.0.0 --port 8000 exposes OpenAI-compatible endpoints, realtime streaming, vector visualizations, and a document browser.
Python users orchestrate the same pipeline via:

from rocketrag import RocketRAG
rag = RocketRAG("./data")
rag.prepare()
answer, sources = rag.ask("What are the key findings?")

Why it Matters
#

Optimized for speed end to end (Kreuzberg loaders, Chonkie chunking, SIMD-friendly embeddings, Milvus Lite retrieval).
End-user friendly with verbose CLI UX plus full FastAPI server, health checks, visualizers, and chat UI.
Batteries-included configuration, yet fully extensible for custom loaders, chunkers, vectorizers, or LLM runtimes.
Designed for reproducible deployments: consistent chunking, metadata preservation, and testing via pytest + ruff.

Overview #

Architecture Highlights #

Usage Notes #

Why it Matters #

Overview
#

Architecture Highlights
#

Usage Notes
#

Why it Matters
#