Skip to main content

Machine-Aided Detection of SARS-CoV-2 from Complete Blood Count

The study investigated whether routine complete blood count (CBC) tests can be used to detect SARS-CoV-2 infection using machine learning (ML). The motivation was to create a cost-effective, scalable screening tool that could complement or preselect patients for RT-PCR testing.

Data Sources
#

Two datasets were used:

  1. UCC Dataset (Poland) – 22,463 patient records (2019–2020) from the University Clinical Centre in Gdańsk, with CBC and RT-PCR test results.
  2. Zenodo Dataset (Italy) – 1,624 patients (2020–2021) from San Raffaele Hospital, provided by Cabitza et al. Only the CBC subset was used.

COVID-19 status was determined by RT-PCR results. Ambiguous or borderline cases were excluded.

Preprocessing
#

  • Only CBC features common to both datasets were selected.
  • Highly correlated feature pairs (correlation > 0.5) were pruned to avoid redundancy.
  • Missing continuous values were imputed using k-nearest neighbors (k-NN).
  • Features were standardized to zero mean and unit variance.
  • Demographic variables (age, sex) were included as additional inputs.
  • SMOTE (Synthetic Minority Oversampling Technique) was used to address class imbalance for some models.

Selected Features
#

Nine CBC parameters were used: HGB, MCV, MCHC, PLT, WBC, BA, LY, MO, EO, plus age and sex.

Models
#

Four machine learning models were trained and compared:

  1. XGBoost – Gradient boosting decision tree with tuned hyperparameters.
  2. CatBoost – Categorical gradient boosting using cross-entropy loss.
  3. Fully Connected ANN – 5 dense layers with dropout and sparse categorical cross-entropy loss.
  4. TabNet – Attention-based deep neural network optimized for tabular data.

TabNet hyperparameters: Adam optimizer, learning rate 0.01, 1000 epochs (early stopping with patience 60), batch size 256/64, weighted categorical cross-entropy loss.

Experimental Setup
#

  • Each dataset was split into 80% training and 20% validation.

  • Performance was evaluated using Accuracy, Precision, Recall, AUC, and F1-score.

  • Experiments tested:

    1. Baseline training (UCC vs Zenodo separately)
    2. Effect of balancing data
    3. Knowledge transfer (joint training on both datasets)

Results
#

1. Baseline (Unbalanced Data)
#

Model Dataset Best Metric F1-score
TabNet UCC Overall best (Acc=81.8%, Sens=82.2%) 15.9%*
CatBoost Zenodo Balanced performance 74.7%

(*Note: low F1 in unbalanced UCC reflects class skew.)

2. Balanced Dataset (UCC)
#

Balancing improved results significantly:

Model F1-score AUC
XGBoost 66.3% 69.9%
CatBoost 70.5% 72.4%
ANN 72.0% 74.0%
TabNet 87.4% 87.1%

Balancing notably increased specificity and overall F1. TabNet achieved the highest metrics across all categories.

3. Knowledge Transfer (UCC + Zenodo)
#

Models trained jointly on both datasets showed reduced performance when tested on the UCC set, but slight gains when tested on Zenodo. This highlighted cross-population variability and the challenge of generalizing across cohorts due to differences in equipment, demographics, and data distributions.

Key Findings
#

  • CBC data alone can encode sufficient information for COVID-19 detection using ML.
  • Balancing data significantly improves classification metrics.
  • TabNet consistently outperformed tree-based models and a standard ANN, particularly on tabular data.
  • Cross-country generalization remains difficult; models require multi-site training and external validation for clinical deployment.

Conclusion
#

The study demonstrates that routine CBC results, when processed with appropriate ML models, can serve as a fast, low-cost screening method for SARS-CoV-2. Deep learning models like TabNet can outperform traditional tree-based methods, though their deployment requires validation across diverse patient populations and clinical settings.