LOMA: Mobile Offline Medical AI Assistant
LOMA (Local Offline Medical Assistant) delivers a zero-cloud experience: the entire assistant, from embeddings to language model responses, runs on the user’s phone.
Motivation #
Billions of people need reliable medical answers where connectivity is limited or privacy is paramount.
LOMA (Local Offline Medical Assistant) delivers a zero-cloud experience: the entire assistant, from embeddings to language model responses, runs on the user’s phone.
System Design #
- Model – Gemma 3n converted to a 4.79 GB GGUF checkpoint, served via
llama.rnwith GPU offload for up to 99 layers. - Retrieval – A doc2query-enhanced RAG stack indexes 5 million Q&A style medical documents so answers are grounded and cite exact sources.
- Embeddings – ExecuTorch runs
all-MiniLM-L6-v2locally, generating 384-d vectors in ~70 ms using only 150–190 MB RAM. - Database – Turso (SQLite + vector extensions) ships as a pre-built bundle synced through Cloudflare R2; cosine search yields ~45 s results without ballooning storage.
- Frontend – React Native application with shared abstractions for storage, queue-based inference, and lazy loading to keep both iOS and Android responsive.
Workflow #
- A user question is normalized into Gemma’s conversation format.
- The query embedding searches both long-form documents and FAQ-style pairs.
- Retrieved passages are assembled with structured citations.
- Gemma 3n generates an answer entirely on-device, never sharing data with servers.
Impact & Metrics #
- Privacy-preserving responses with verifiable citations improve trust for clinical decision support.
- Works offline after the initial 4.79 GB download; model + DB fit comfortably on mid-range phones.
- Vector search: 94 ms for 50k vectors, ~45 s for the full 5 M-document store – acceptable tradeoff for a 250 MB footprint.
- Response latency stays under one minute even on modest hardware, broadening device eligibility.