Information Extraction from Polish Radiology Reports using Language Models
Automatic parametrization model for Polish radiology reports using deep language models. Trained on 1,200 annotated chest CT reports labeled with 44 observation tags, our model achieves an F1 score of 81%.
Radiology reports are critical for patient care but are typically written as free-text, which introduces ambiguity, inconsistency, and omission risks. Although structured reporting (SR) provides standardization, it is rarely adopted due to workflow burdens.
In our work, we propose an automatic parametrization model for Polish radiology reports using deep language models. Trained on 1,200 annotated chest CT reports labeled with 44 observation tags, our model achieves an F1 score of 81%.
The system bridges free-text flexibility with structured interpretability, facilitating clinical integration and data reuse.

1. Introduction #
Radiology reports guide clinical decisions but are often written in unstructured natural language. While this enables expressive flexibility, it hinders machine readability and consistent interpretation.
Structured Reporting (SR) — endorsed by RSNA and ESR — improves accuracy, consistency, and integration with classification systems (e.g., CO-RADS). However, SR adoption remains low due to perceived rigidity and workflow overhead.
To reconcile expressiveness with structure, we propose a language-model–based information extraction system that identifies radiological observations from Polish free-text reports and assigns them standardized tags.
Formally, this is a sequence labeling task under the information extraction paradigm, not simple NER, since radiological findings are contextual and span-dependent.
2. Related Work #
Structured Reporting #
Efforts toward SR include:
- BI-RADS, CO-RADS disease-specific templates.
- DICOM-SR (structured radiology data).
- RadLex ontology and AIM project for semantic interoperability.
- RSNA radreport.org templates.
Despite these, adoption is still limited.
Clinical IE and NER #
Prior work has applied BERT-based IE for Spanish and English radiology reports (e.g., Solarte-Pabón et al. 2021, Jain et al. 2021). Models like BioBERT, ClinicalBERT, and BlueBERT show strong performance on English clinical corpora.
For Polish, prior systems (Mykowiecka et al. 2009) were rule-based. No medical-domain Polish language models existed prior to this work.
Polish Language Models #
We evaluated the following general-domain models:
- Polish RoBERTa-base-v2
- Polish DistilRoBERTa
- Polish Longformer
- HerBERT
- mLUKE (multilingual LUKE, entity-aware pretraining)
3. Our Approach #
3.1 Dataset #
We collected 1,200 anonymized chest X-ray reports from University Clinical Centre in Gdańsk, Poland. Two clinical experts annotated the data using IOB (Inside–Outside–Beginning) schema across 44 radiological observation classes.
| Entity (EN) | Train | Test |
|---|---|---|
| Pleural effusion | 722 | 184 |
| Pulmonary fibrosis | 631 | 165 |
| Bone lesions | 619 | 156 |
| Pulmonary consolidation | 543 | 143 |
| Ground-glass opacities | 482 | 141 |
| … | … | … |
Full list in the paper
3.2 Pre-processing #
- Anonymization: Names and IDs removed.
- Sentence splitting and tokenization via Stanza.
- Train/test split = 80/20.
- Two rare labels (<8 instances) were dropped.
3.3 Model Architecture #
Each model encodes tokenized input and outputs contextual embeddings, passed through a fully-connected layer mapping tokens to BIO-tagged entities.
Only first tokens of words are used in loss computation; others are masked (-100 label).
For comparison, we also trained a Flair + BiLSTM + CRF baseline.
Loss function:
$$ \mathcal{L} = -\sum_{i} y_i \log(\hat{y}_i) $$Optimizer: Adam \(\mathrm{lr} = 1 \times 10^{-5}\) Warmup: 10% linear schedule
4. Experiments and Results #
4.1 Model Comparison #
| Model | Precision | Recall | F1-score |
|---|---|---|---|
| HerBERT | 0.718 | 0.798 | 0.745 |
| Flair | 0.749 | 0.759 | 0.751 |
| DistilRoBERTa | 0.752 | 0.807 | 0.768 |
| Longformer | 0.767 | 0.809 | 0.778 |
| RoBERTa | 0.768 | 0.811 | 0.780 |
| mLUKE | 0.791 | 0.826 | 0.809 |
mLUKE achieved the best performance (F1 = 0.81), likely due to its entity-aware attention pretraining using Wikipedia entity links.
4.2 Per-Class Performance (mLUKE) #
| Class (EN) | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| Ground-glass opacities | 0.96 | 0.97 | 0.96 | 141 |
| Pulmonary consolidations | 0.84 | 0.87 | 0.86 | 62 |
| Pleural effusion | 0.82 | 0.85 | 0.83 | 187 |
| Cardiomegaly | 0.88 | 0.89 | 0.88 | 47 |
| Aortic dissection | 1.00 | 1.00 | 1.00 | 1 |
| Pulmonary embolism | 1.00 | 1.00 | 1.00 | 1 |
| Average (macro) | 0.73 | 0.78 | 0.75 | — |
| Average (micro) | 0.79 | 0.83 | 0.81 | 1981 |
5. Discussion #
Our system demonstrates that general-domain Polish language models can achieve strong performance on medical IE tasks — despite no domain-specific pretraining.
Observations:
- Accuracy correlates with class frequency.
- Low-frequency entities (e.g., “aortic dissection”) can still achieve high F1 if semantically unambiguous.
- The model’s main challenge is accurate span detection, not classification confusion.
This method can support automatic structuring of legacy reports, enabling data mining, cohort identification, and integration with clinical databases.
6. Future Work #
We plan to:
- Pretrain domain-specific Polish medical language models.
- Expand the dataset with unlabeled corpora for masked-language modeling.
- Explore relation extraction between entities (e.g., findings ↔ anatomical locations).
7. Conclusion #
We introduced the first information extraction model for Polish radiology reports, achieving an F1 of 81% using multilingual and Polish transformer architectures.
This approach bridges free-text expressiveness with structured data interoperability, opening the way toward semi-automated structured reporting in Polish clinical practice.
Repository available through the GitHub card above.
Citation #
Obuchowski, A., Klaudel, B., & Jasik, P. (2023). Information Extraction from Polish Radiology Reports using Language Models. In Proceedings of the 9th Workshop on Slavic NLP (BSNLP 2023), pp. 113–122. Association for Computational Linguistics. https://aclanthology.org/2023.bsnlp-1.14
@inproceedings{obuchowski2023information,
title={Information extraction from polish radiology reports using language models},
author={Obuchowski, Aleksander and Klaudel, Barbara and Jasik, Patryk},
booktitle={Proceedings of the 9th workshop on slavic natural language processing 2023 (SlavicNLP 2023)},
pages={113--122},
year={2023}
}