Machine-Aided Detection of SARS-CoV-2 from Complete Blood Count

The study investigated whether routine complete blood count (CBC) tests can be used to detect SARS-CoV-2 infection using machine learning (ML). The motivation was to create a cost-effective, scalable screening tool that could complement or preselect patients for RT-PCR testing.

Data Sources
#

Two datasets were used:

UCC Dataset (Poland) – 22,463 patient records (2019–2020) from the University Clinical Centre in Gdańsk, with CBC and RT-PCR test results.
Zenodo Dataset (Italy) – 1,624 patients (2020–2021) from San Raffaele Hospital, provided by Cabitza et al. Only the CBC subset was used.

COVID-19 status was determined by RT-PCR results. Ambiguous or borderline cases were excluded.

Preprocessing
#

Only CBC features common to both datasets were selected.
Highly correlated feature pairs (correlation > 0.5) were pruned to avoid redundancy.
Missing continuous values were imputed using k-nearest neighbors (k-NN).
Features were standardized to zero mean and unit variance.
Demographic variables (age, sex) were included as additional inputs.
SMOTE (Synthetic Minority Oversampling Technique) was used to address class imbalance for some models.

Selected Features
#

Nine CBC parameters were used: HGB, MCV, MCHC, PLT, WBC, BA, LY, MO, EO, plus age and sex.

Models
#

Four machine learning models were trained and compared:

XGBoost – Gradient boosting decision tree with tuned hyperparameters.
CatBoost – Categorical gradient boosting using cross-entropy loss.
Fully Connected ANN – 5 dense layers with dropout and sparse categorical cross-entropy loss.
TabNet – Attention-based deep neural network optimized for tabular data.

TabNet hyperparameters: Adam optimizer, learning rate 0.01, 1000 epochs (early stopping with patience 60), batch size 256/64, weighted categorical cross-entropy loss.

Experimental Setup
#

Each dataset was split into 80% training and 20% validation.
Performance was evaluated using Accuracy, Precision, Recall, AUC, and F1-score.
Experiments tested:
1. Baseline training (UCC vs Zenodo separately)
2. Effect of balancing data
3. Knowledge transfer (joint training on both datasets)

Results
#

1. Baseline (Unbalanced Data)
#

Model	Dataset	Best Metric	F1-score
TabNet	UCC	Overall best (Acc=81.8%, Sens=82.2%)	15.9%*
CatBoost	Zenodo	Balanced performance	74.7%

(*Note: low F1 in unbalanced UCC reflects class skew.)

2. Balanced Dataset (UCC)
#

Balancing improved results significantly:

Model	F1-score	AUC
XGBoost	66.3%	69.9%
CatBoost	70.5%	72.4%
ANN	72.0%	74.0%
TabNet	87.4%	87.1%

Balancing notably increased specificity and overall F1. TabNet achieved the highest metrics across all categories.

3. Knowledge Transfer (UCC + Zenodo)
#

Models trained jointly on both datasets showed reduced performance when tested on the UCC set, but slight gains when tested on Zenodo. This highlighted cross-population variability and the challenge of generalizing across cohorts due to differences in equipment, demographics, and data distributions.

Key Findings
#

CBC data alone can encode sufficient information for COVID-19 detection using ML.
Balancing data significantly improves classification metrics.
TabNet consistently outperformed tree-based models and a standard ANN, particularly on tabular data.
Cross-country generalization remains difficult; models require multi-site training and external validation for clinical deployment.

Conclusion
#

The study demonstrates that routine CBC results, when processed with appropriate ML models, can serve as a fast, low-cost screening method for SARS-CoV-2. Deep learning models like TabNet can outperform traditional tree-based methods, though their deployment requires validation across diverse patient populations and clinical settings.

Data Sources #

Preprocessing #

Selected Features #

Models #

Experimental Setup #

Results #

1. Baseline (Unbalanced Data) #

2. Balanced Dataset (UCC) #

3. Knowledge Transfer (UCC + Zenodo) #

Key Findings #

Conclusion #