Skip to main content

LOMA: Mobile Offline Medical AI Assistant

LOMA (Local Offline Medical Assistant) delivers a zero-cloud experience: the entire assistant, from embeddings to language model responses, runs on the user’s phone.

Motivation
#

Billions of people need reliable medical answers where connectivity is limited or privacy is paramount.
LOMA (Local Offline Medical Assistant) delivers a zero-cloud experience: the entire assistant, from embeddings to language model responses, runs on the user’s phone.

System Design
#

  • Model – Gemma 3n converted to a 4.79 GB GGUF checkpoint, served via llama.rn with GPU offload for up to 99 layers.
  • Retrieval – A doc2query-enhanced RAG stack indexes 5 million Q&A style medical documents so answers are grounded and cite exact sources.
  • Embeddings – ExecuTorch runs all-MiniLM-L6-v2 locally, generating 384-d vectors in ~70 ms using only 150–190 MB RAM.
  • Database – Turso (SQLite + vector extensions) ships as a pre-built bundle synced through Cloudflare R2; cosine search yields ~45 s results without ballooning storage.
  • Frontend – React Native application with shared abstractions for storage, queue-based inference, and lazy loading to keep both iOS and Android responsive.

Workflow
#

  1. A user question is normalized into Gemma’s conversation format.
  2. The query embedding searches both long-form documents and FAQ-style pairs.
  3. Retrieved passages are assembled with structured citations.
  4. Gemma 3n generates an answer entirely on-device, never sharing data with servers.

Impact & Metrics
#

  • Privacy-preserving responses with verifiable citations improve trust for clinical decision support.
  • Works offline after the initial 4.79 GB download; model + DB fit comfortably on mid-range phones.
  • Vector search: 94 ms for 50k vectors, ~45 s for the full 5 M-document store – acceptable tradeoff for a 250 MB footprint.
  • Response latency stays under one minute even on modest hardware, broadening device eligibility.

🔗 Read the full build log