Simple Retrieval Augmented Classification

training-free retrieval-augmented classification framework that can enhance any VLM with minimal resources.

Motivation
#

Vision-Language Models (VLMs) have achieved remarkable results in image understanding, but in radiology their static training data often limits accuracy and adaptability. We explored whether adding retrieval mechanisms — not text-based, but image-based — could improve diagnostic performance without retraining models.

Our goal was to build a training-free retrieval-augmented classification framework that can enhance any VLM with minimal resources.

Methodology Overview
#

We developed RAD-SRAC (Simple Retrieval-Augmented Classification) — a lightweight, few-shot framework designed to augment classification tasks across radiological modalities (CT, MRI, X-ray).

The pipeline works as follows:

Vector Database Construction
- We encoded all dataset images using the MedImageInsight encoder — a domain-specific medical imaging model.
- Embeddings were stored in Qdrant, an open-source vector database.
- Each dataset split formed a searchable repository of image features and associated labels.
Retrieval Process
- For every query image, we retrieved the top-k (1–10) most similar samples using cosine similarity.
- Retrieved images were appended to the model’s prompt as few-shot examples, either with labels or unlabeled.
Prompting Setup
- All models used the same structured protocol:
  - System Context: Defines modality and available classes.
  - Few-Shot Section: Injects retrieved reference images.
  - Classification Request: Asks for a JSON output with predicted_class and brief reasoning.
- Example (simplified):
```
You are a medical expert. Analyze the CT image and classify it into {labels}.
Examples:
  class: [clear_cell_RCC]
  [image]
Now analyze the new image and output JSON:
  {'y_pred': ..., 'explanation': ...}
```
Evaluation Protocol
- We compared baseline (“raw”) VLM classification with SRAC-augmented versions.
- Each test image was re-run up to 5 times if the model failed to follow output structure.

Datasets
#

Dataset	Modality	Classes	Samples
KITS23	CT	5 tumor subtypes	424
Coronahack	X-ray	Normal / Bacterial / Viral pneumonia	5908
Brain Tumor Classification	MRI	No tumor / Glioma / Meningioma / Pituitary	3264

Each dataset was split into:

Database split for vector storage.
Test split (100 stratified samples).

Models Evaluated
#

We tested the SRAC approach on both large state-of-the-art VLMs and smaller deployable models:

Claude 3.5 Sonnet
GPT-4o
Gemini 1.5 Pro
Qwen2-VL 72B
Gemini 1.5 Flash-8B
Pixtral-12B

All models used temperature = 1.

Results
#

1. Large-Scale Models
#

SRAC substantially improved F1 scores across all datasets:

Dataset	Model	F1 (Raw)	F1 (SRAC)	Δ
KITS23	GPT-4o	57%	63%	+6%
	Claude 3.5	53%	61%	+8%
Coronahack	GPT-4o	41%	76%	+35%
	Claude 3.5	46%	76%	+30%
Brain Tumor	GPT-4o	59%	94%	+35%
	Claude 3.5	56%	91%	+35%

The largest relative gain was 142% on Coronahack (Gemini 1.5 Pro), while the toughest dataset (KITS23) showed smaller but consistent improvements.

2. Small Deployable Models
#

For smaller, on-premise models:

Dataset	Model	F1 (Raw)	F1 (SRAC)	Δ
Coronahack	Pixtral-12B	17%	58%	+41%
Brain Tumor	Pixtral-12B	23%	66%	+43%
KITS23	Gemini Flash-8B	45%	63%	+18%

These results demonstrate that SRAC can bridge the performance gap between small and large models — critical for clinical environments where data must remain local.

3. Optimal Number of Examples
#

Performance peaked at 3–5 retrieved images, beyond which gains saturated or declined — indicating diminishing returns and supporting practical efficiency for deployment.

Visual Analysis
#

t-SNE projections of MedImageInsight embeddings showed:

Clear separability for Coronahack and Brain Tumor classes.
Overlapping clusters for KITS23, explaining its lower SRAC gains due to subtle inter-class differences.

Discussion
#

Key insights:

Encoder quality dominates: MedImageInsight effectively captured high-level modality features, but fine-grained CT distinctions remain difficult.
Retrieval scale is bounded: Excessive examples degrade accuracy, likely due to context dilution.
Privacy-friendly deployability: SRAC enables effective use of small, local VLMs under healthcare data governance rules.
Training-free generalization: No retraining or finetuning required — only database construction and prompt adaptation.

Conclusion
#

RAD-SRAC demonstrates that retrieval-augmented classification can deliver up to 250% F1 improvement across radiology datasets with no model training. Optimal performance with 3–5 few-shot retrieved cases makes it a lightweight and practical enhancement for both large and small VLMs, especially in clinical settings demanding on-premise, privacy-preserving AI.

Code: github.com/TheLion-ai/RAD-SRAC

Motivation #

Methodology Overview #

Datasets #

Models Evaluated #

Results #

1. Large-Scale Models #

2. Small Deployable Models #

3. Optimal Number of Examples #

Visual Analysis #

Discussion #

Conclusion #