Target-Free Domain Adaptation through Cross-Adaptation

A new approach to domain adaptation called cross-adaptation, which enables machine learning models to generalize across datasets from different sources — without requiring labeled examples from the target domain.

TheLion-ai/cross-adaptation

Jupyter Notebook

In this work, we introduce a new approach to domain adaptation called cross-adaptation, which enables machine learning models to generalize across datasets from different sources — without requiring labeled examples from the target domain. We validate this method on nine datasets for SARS-CoV-2 detection from complete blood count (CBC) data collected from hospitals around the world.

Motivation
#

Machine learning models in healthcare often fail to generalize beyond the data source they were trained on. Even for the same clinical task, datasets from different hospitals can have distinct population characteristics. This dataset bias severely limits the clinical deployment of AI models.

Existing domain adaptation techniques typically assume access to some data from the target domain, often with labels. However, in real-world clinical scenarios, labeled target data may be unavailable due to privacy, regulatory, or logistical constraints. Our goal was to develop a target-free adaptation method that could overcome these limitations.

Method: Cross-Adaptation
#

The proposed approach, cross-adaptation, performs domain adaptation iteratively across multiple available datasets. It generalizes across all source domains without direct access to the target one.

Formally, let there be $n$ datasets $D = \{D_1, D_2, ..., D_n\}$, each representing a different hospital or population. For each $D_i$:

Treat $D_i$ as the source dataset ($D_s$).
Treat all remaining datasets as target datasets ($D_t = D \setminus D_i$).
Apply a chosen domain adaptation algorithm ($g$) to transform $D_s$ relative to $D_t$: $$ D_{x_i} = g(D_s, D_t) $$
Add the transformed dataset $D_{x_i}$ to a new collection $D_x$.

After all iterations, the transformed datasets $D_x = \{D_{x_1}, D_{x_2}, ..., D_{x_n}\}$ are concatenated and used to train a final model $f$. This model can then be applied to unseen domains.

The process is algorithmically independent — any domain adaptation method $g$ can be used (e.g., KMM, KLIEP, or TrAdaBoost).

Experimental Setup
#

We evaluated cross-adaptation on the task of COVID-19 detection using Complete Blood Count (CBC) tests. Nine datasets were used in total:

8 publicly available datasets
1 private dataset (Cabitza et al., 2021)

The combined dataset contained 4,870 CBC records from hospitals located in Brazil, Italy, Poland, Ethiopia, and Spain. Ten CBC features were selected as model inputs:

white blood cells, hemoglobin, mean corpuscular volume, mean corpuscular hemoglobin concentration, platelets, monocytes %, basophils %, lymphocytes %, eosinophils %, and sex.

The target variable was SARS-CoV-2 infection status confirmed by RT-PCR.

Training was performed only on source data; test and validation datasets were never transformed. Each transformation scenario used cross-adaptation to generate transformed training sets.

Algorithms Tested
#

We evaluated five machine learning models:

k-Nearest Neighbors (KNN)
Decision Tree
Random Forest
XGBoost
Multilayer Perceptron (MLP)

Each model was trained using three domain adaptation techniques:

Kernel Mean Matching (KMM)
Kullback–Leibler Importance Estimation Procedure (KLIEP)
Transfer AdaBoost (TrAdaBoost)

Hyperparameter tuning was applied in both baseline and adapted settings.

Results
#

Model	Baseline F1	Cross-Adaptation F1
KNN	55%	73%
Decision Tree	65%	73%
Random Forest	62%	70%
XGBoost	62%	73%
MLP	65%	72%

On average, cross-adaptation improved F1 scores by approximately 10 percentage points across all models. This demonstrates the robustness and universality of the method — it works regardless of the underlying classifier or domain adaptation algorithm.

Conclusion and Future Work
#

Cross-adaptation enables target-free domain adaptation across multiple source domains, making it particularly suitable for global healthcare AI systems trained on heterogeneous, multi-institutional data. Future research will extend this framework to other modalities (e.g., imaging, genomics) and tasks beyond COVID-19 detection.

Code is accessible via the GitHub card above.

Motivation #

Method: Cross-Adaptation #

Experimental Setup #

Algorithms Tested #

Results #

Conclusion and Future Work #