Machine-Aided Detection of SARS-CoV-2 from Complete Blood Count
The study investigated whether routine complete blood count (CBC) tests can be used to detect SARS-CoV-2 infection using machine learning (ML). The motivation was to create a cost-effective, scalable screening tool that could complement or preselect patients for RT-PCR testing.
Data Sources #
Two datasets were used:
- UCC Dataset (Poland) – 22,463 patient records (2019–2020) from the University Clinical Centre in Gdańsk, with CBC and RT-PCR test results.
- Zenodo Dataset (Italy) – 1,624 patients (2020–2021) from San Raffaele Hospital, provided by Cabitza et al. Only the CBC subset was used.
COVID-19 status was determined by RT-PCR results. Ambiguous or borderline cases were excluded.
Preprocessing #
- Only CBC features common to both datasets were selected.
- Highly correlated feature pairs (correlation > 0.5) were pruned to avoid redundancy.
- Missing continuous values were imputed using k-nearest neighbors (k-NN).
- Features were standardized to zero mean and unit variance.
- Demographic variables (age, sex) were included as additional inputs.
- SMOTE (Synthetic Minority Oversampling Technique) was used to address class imbalance for some models.
Selected Features #
Nine CBC parameters were used:
HGB, MCV, MCHC, PLT, WBC, BA, LY, MO, EO, plus age and sex.
Models #
Four machine learning models were trained and compared:
- XGBoost – Gradient boosting decision tree with tuned hyperparameters.
- CatBoost – Categorical gradient boosting using cross-entropy loss.
- Fully Connected ANN – 5 dense layers with dropout and sparse categorical cross-entropy loss.
- TabNet – Attention-based deep neural network optimized for tabular data.
TabNet hyperparameters: Adam optimizer, learning rate 0.01, 1000 epochs (early stopping with patience 60), batch size 256/64, weighted categorical cross-entropy loss.
Experimental Setup #
-
Each dataset was split into 80% training and 20% validation.
-
Performance was evaluated using Accuracy, Precision, Recall, AUC, and F1-score.
-
Experiments tested:
- Baseline training (UCC vs Zenodo separately)
- Effect of balancing data
- Knowledge transfer (joint training on both datasets)
Results #
1. Baseline (Unbalanced Data) #
| Model | Dataset | Best Metric | F1-score |
|---|---|---|---|
| TabNet | UCC | Overall best (Acc=81.8%, Sens=82.2%) | 15.9%* |
| CatBoost | Zenodo | Balanced performance | 74.7% |
(*Note: low F1 in unbalanced UCC reflects class skew.)
2. Balanced Dataset (UCC) #
Balancing improved results significantly:
| Model | F1-score | AUC |
|---|---|---|
| XGBoost | 66.3% | 69.9% |
| CatBoost | 70.5% | 72.4% |
| ANN | 72.0% | 74.0% |
| TabNet | 87.4% | 87.1% |
Balancing notably increased specificity and overall F1. TabNet achieved the highest metrics across all categories.
3. Knowledge Transfer (UCC + Zenodo) #
Models trained jointly on both datasets showed reduced performance when tested on the UCC set, but slight gains when tested on Zenodo. This highlighted cross-population variability and the challenge of generalizing across cohorts due to differences in equipment, demographics, and data distributions.
Key Findings #
- CBC data alone can encode sufficient information for COVID-19 detection using ML.
- Balancing data significantly improves classification metrics.
- TabNet consistently outperformed tree-based models and a standard ANN, particularly on tabular data.
- Cross-country generalization remains difficult; models require multi-site training and external validation for clinical deployment.
Conclusion #
The study demonstrates that routine CBC results, when processed with appropriate ML models, can serve as a fast, low-cost screening method for SARS-CoV-2. Deep learning models like TabNet can outperform traditional tree-based methods, though their deployment requires validation across diverse patient populations and clinical settings.