UMIE Datasets: Unified Medical Imaging Ecosystem
UMIE (Unified Medical Imaging Environment) standardizes 882,774 images across 20+ open medical datasets covering CT, MRI, and X-ray modalities.
Pipelines download, clean, and annotate each source, outputting .png assets with harmonized metadata, segmentation masks, and RadLex-compliant labels.
Pipeline Design #
- Modular scikit-learn style steps transform DICOM, NIfTI, TIFF, and JPG sources into a consistent structure.
- Reusable components handle spacing normalization, mask extraction, ontology remapping, and file-tree creation.
- Adding a new dataset typically means composing existing steps and configuring paths inside
config/runner_config.py. - Shared RadLex ontology eliminates label drift and enables frictionless multi-dataset training.
Dataset Coverage (excerpt) #
KITS-23, Coronahack, Brain Tumor MRI (multiple versions), ChestX-ray14, COCA CT calcium, BrainMetShare, CT-ORG, LIDC-IDRI, CMMD, and many more—including segmentation-ready corpora with paired masks.
Tooling & Ops #
poetrymanages dependencies,pre-commitenforces formatting, and GitHub Actions run the test matrix.- Contributors can dry-run checks with
pre-commit run --all-filesand executerun_tests.shfor integration coverage. - Roadmap targets HuggingFace dataset exports and rich data dashboards for curation visibility.
Why it Matters #
- Gives researchers a turnkey way to assemble large, diverse medical imaging corpora without redoing preprocessing.
- Reduces ontology mismatches when mixing classification and segmentation tasks.
- Encourages reproducibility by scripting every download, conversion, and metadata step in the open.
GitHub repository linked through the card above.