Skip to main content

UMIE Datasets: Unified Medical Imaging Ecosystem

UMIE (Unified Medical Imaging Environment) standardizes 882,774 images across 20+ open medical datasets covering CT, MRI, and X-ray modalities.
Pipelines download, clean, and annotate each source, outputting .png assets with harmonized metadata, segmentation masks, and RadLex-compliant labels.

Pipeline Design
#

  • Modular scikit-learn style steps transform DICOM, NIfTI, TIFF, and JPG sources into a consistent structure.
  • Reusable components handle spacing normalization, mask extraction, ontology remapping, and file-tree creation.
  • Adding a new dataset typically means composing existing steps and configuring paths inside config/runner_config.py.
  • Shared RadLex ontology eliminates label drift and enables frictionless multi-dataset training.

Dataset Coverage (excerpt)
#

KITS-23, Coronahack, Brain Tumor MRI (multiple versions), ChestX-ray14, COCA CT calcium, BrainMetShare, CT-ORG, LIDC-IDRI, CMMD, and many more—including segmentation-ready corpora with paired masks.

Tooling & Ops
#

  • poetry manages dependencies, pre-commit enforces formatting, and GitHub Actions run the test matrix.
  • Contributors can dry-run checks with pre-commit run --all-files and execute run_tests.sh for integration coverage.
  • Roadmap targets HuggingFace dataset exports and rich data dashboards for curation visibility.

Why it Matters
#

  • Gives researchers a turnkey way to assemble large, diverse medical imaging corpora without redoing preprocessing.
  • Reduces ontology mismatches when mixing classification and segmentation tasks.
  • Encourages reproducibility by scripting every download, conversion, and metadata step in the open.

GitHub repository linked through the card above.