Simple Retrieval Augmented Classification
training-free retrieval-augmented classification framework that can enhance any VLM with minimal resources.
Motivation #
Vision-Language Models (VLMs) have achieved remarkable results in image understanding, but in radiology their static training data often limits accuracy and adaptability. We explored whether adding retrieval mechanisms — not text-based, but image-based — could improve diagnostic performance without retraining models.
Our goal was to build a training-free retrieval-augmented classification framework that can enhance any VLM with minimal resources.
Methodology Overview #
We developed RAD-SRAC (Simple Retrieval-Augmented Classification) — a lightweight, few-shot framework designed to augment classification tasks across radiological modalities (CT, MRI, X-ray).

The pipeline works as follows:
-
Vector Database Construction
- We encoded all dataset images using the MedImageInsight encoder — a domain-specific medical imaging model.
- Embeddings were stored in Qdrant, an open-source vector database.
- Each dataset split formed a searchable repository of image features and associated labels.
-
Retrieval Process
- For every query image, we retrieved the top-k (1–10) most similar samples using cosine similarity.
- Retrieved images were appended to the model’s prompt as few-shot examples, either with labels or unlabeled.
-
Prompting Setup
-
All models used the same structured protocol:
- System Context: Defines modality and available classes.
- Few-Shot Section: Injects retrieved reference images.
- Classification Request: Asks for a JSON output with
predicted_classand brief reasoning.
-
Example (simplified):
You are a medical expert. Analyze the CT image and classify it into {labels}. Examples: class: [clear_cell_RCC] [image] Now analyze the new image and output JSON: {'y_pred': ..., 'explanation': ...}
-
-
Evaluation Protocol
- We compared baseline (“raw”) VLM classification with SRAC-augmented versions.
- Each test image was re-run up to 5 times if the model failed to follow output structure.
Datasets #
| Dataset | Modality | Classes | Samples |
|---|---|---|---|
| KITS23 | CT | 5 tumor subtypes | 424 |
| Coronahack | X-ray | Normal / Bacterial / Viral pneumonia | 5908 |
| Brain Tumor Classification | MRI | No tumor / Glioma / Meningioma / Pituitary | 3264 |
Each dataset was split into:
- Database split for vector storage.
- Test split (100 stratified samples).
Models Evaluated #
We tested the SRAC approach on both large state-of-the-art VLMs and smaller deployable models:
- Claude 3.5 Sonnet
- GPT-4o
- Gemini 1.5 Pro
- Qwen2-VL 72B
- Gemini 1.5 Flash-8B
- Pixtral-12B
All models used temperature = 1.
Results #
1. Large-Scale Models #
SRAC substantially improved F1 scores across all datasets:
| Dataset | Model | F1 (Raw) | F1 (SRAC) | Δ |
|---|---|---|---|---|
| KITS23 | GPT-4o | 57% | 63% | +6% |
| Claude 3.5 | 53% | 61% | +8% | |
| Coronahack | GPT-4o | 41% | 76% | +35% |
| Claude 3.5 | 46% | 76% | +30% | |
| Brain Tumor | GPT-4o | 59% | 94% | +35% |
| Claude 3.5 | 56% | 91% | +35% |
The largest relative gain was 142% on Coronahack (Gemini 1.5 Pro), while the toughest dataset (KITS23) showed smaller but consistent improvements.
2. Small Deployable Models #
For smaller, on-premise models:
| Dataset | Model | F1 (Raw) | F1 (SRAC) | Δ |
|---|---|---|---|---|
| Coronahack | Pixtral-12B | 17% | 58% | +41% |
| Brain Tumor | Pixtral-12B | 23% | 66% | +43% |
| KITS23 | Gemini Flash-8B | 45% | 63% | +18% |
These results demonstrate that SRAC can bridge the performance gap between small and large models — critical for clinical environments where data must remain local.
3. Optimal Number of Examples #
Performance peaked at 3–5 retrieved images, beyond which gains saturated or declined — indicating diminishing returns and supporting practical efficiency for deployment.
Visual Analysis #
t-SNE projections of MedImageInsight embeddings showed:
- Clear separability for Coronahack and Brain Tumor classes.
- Overlapping clusters for KITS23, explaining its lower SRAC gains due to subtle inter-class differences.
Discussion #
Key insights:
- Encoder quality dominates: MedImageInsight effectively captured high-level modality features, but fine-grained CT distinctions remain difficult.
- Retrieval scale is bounded: Excessive examples degrade accuracy, likely due to context dilution.
- Privacy-friendly deployability: SRAC enables effective use of small, local VLMs under healthcare data governance rules.
- Training-free generalization: No retraining or finetuning required — only database construction and prompt adaptation.
Conclusion #
RAD-SRAC demonstrates that retrieval-augmented classification can deliver up to 250% F1 improvement across radiology datasets with no model training. Optimal performance with 3–5 few-shot retrieved cases makes it a lightweight and practical enhancement for both large and small VLMs, especially in clinical settings demanding on-premise, privacy-preserving AI.