Advanced search
2 files | 1.06 MB Add to list

A critical look at studies applying over-sampling on the TPEHGDB dataset

Gilles Vandewiele (UGent) , Isabelle Dehaene (UGent) , Olivier Janssens (UGent) , Femke Ongenae (UGent) , Femke De Backere (UGent) , Filip De Turck (UGent) , Kristien Roelens (UGent) , Sofie Van Hoecke (UGent) and Thomas Demeester (UGent)
Author
Organization
Abstract
Preterm birth is the leading cause of death among young children and has a large prevalence globally. Machine learning models, based on features extracted from clinical sources such as electronic patient files, yield promising results. In this study, we review similar studies that constructed predictive models based on a publicly available dataset, called the Term-Preterm EHG Database (TPEHGDB), which contains electrohysterogram signals on top of clinical data. These studies often report near-perfect prediction results, by applying over-sampling as a means of data augmentation. We reconstruct these results to show that they can only be achieved when data augmentation is applied on the entire dataset prior to partitioning into training and testing set. This results in (i) samples that are highly correlated to data points from the test set are introduced and added to the training set, and (ii) artificial samples that are highly correlated to points from the training set being added to the test set. Many previously reported results therefore carry little meaning in terms of the actual effectiveness of the model in making predictions on unseen data in a real-world setting. After focusing on the danger of applying over-sampling strategies before data partitioning, we present a realistic baseline for the TPEHGDB dataset and show how the predictive performance and clinical use can be improved by incorporating features from electrohysterogram sensors and by applying over-sampling on the training set.
Keywords
Preterm birth, Electrohysterogram (EHG), Imbalanced data, Over-sampling, PRETERM, TERM

Downloads

  • 7535 i.pdf
    • full text
    • |
    • open access
    • |
    • PDF
    • |
    • 272.99 KB
  • (...).pdf
    • full text
    • |
    • UGent only
    • |
    • PDF
    • |
    • 788.46 KB

Citation

Please use this url to cite or link to this publication:

MLA
Vandewiele, Gilles, et al. “A Critical Look at Studies Applying Over-Sampling on the TPEHGDB Dataset.” ARTIFICIAL INTELLIGENCE IN MEDICINE, AIME 2019, edited by David Riaño et al., vol. 11526, Springer, 2019, pp. 355–64.
APA
Vandewiele, G., Dehaene, I., Janssens, O., Ongenae, F., De Backere, F., De Turck, F., … Demeester, T. (2019). A critical look at studies applying over-sampling on the TPEHGDB dataset. In D. Riaño, S. Wilk, & A. ten Teije (Eds.), ARTIFICIAL INTELLIGENCE IN MEDICINE, AIME 2019 (Vol. 11526, pp. 355–364). Poznan, Poland: Springer.
Chicago author-date
Vandewiele, Gilles, Isabelle Dehaene, Olivier Janssens, Femke Ongenae, Femke De Backere, Filip De Turck, Kristien Roelens, Sofie Van Hoecke, and Thomas Demeester. 2019. “A Critical Look at Studies Applying Over-Sampling on the TPEHGDB Dataset.” In ARTIFICIAL INTELLIGENCE IN MEDICINE, AIME 2019, edited by David Riaño, Szymon Wilk, and Annette ten Teije, 11526:355–64. Springer.
Chicago author-date (all authors)
Vandewiele, Gilles, Isabelle Dehaene, Olivier Janssens, Femke Ongenae, Femke De Backere, Filip De Turck, Kristien Roelens, Sofie Van Hoecke, and Thomas Demeester. 2019. “A Critical Look at Studies Applying Over-Sampling on the TPEHGDB Dataset.” In ARTIFICIAL INTELLIGENCE IN MEDICINE, AIME 2019, ed by. David Riaño, Szymon Wilk, and Annette ten Teije, 11526:355–364. Springer.
Vancouver
1.
Vandewiele G, Dehaene I, Janssens O, Ongenae F, De Backere F, De Turck F, et al. A critical look at studies applying over-sampling on the TPEHGDB dataset. In: Riaño D, Wilk S, ten Teije A, editors. ARTIFICIAL INTELLIGENCE IN MEDICINE, AIME 2019. Springer; 2019. p. 355–64.
IEEE
[1]
G. Vandewiele et al., “A critical look at studies applying over-sampling on the TPEHGDB dataset,” in ARTIFICIAL INTELLIGENCE IN MEDICINE, AIME 2019, Poznan, Poland, 2019, vol. 11526, pp. 355–364.
@inproceedings{8628812,
  abstract     = {Preterm birth is the leading cause of death among young children and has a large prevalence globally. Machine learning models, based on features extracted from clinical sources such as electronic patient files, yield promising results. In this study, we review similar studies that constructed predictive models based on a publicly available dataset, called the Term-Preterm EHG Database (TPEHGDB), which contains electrohysterogram signals on top of clinical data. These studies often report near-perfect prediction results, by applying over-sampling as a means of data augmentation. We reconstruct these results to show that they can only be achieved when data augmentation is applied on the entire dataset prior to partitioning into training and testing set. This results in (i) samples that are highly correlated to data points from the test set are introduced and added to the training set, and (ii) artificial samples that are highly correlated to points from the training set being added to the test set. Many previously reported results therefore carry little meaning in terms of the actual effectiveness of the model in making predictions on unseen data in a real-world setting. After focusing on the danger of applying over-sampling strategies before data partitioning, we present a realistic baseline for the TPEHGDB dataset and show how the predictive performance and clinical use can be improved by incorporating features from electrohysterogram sensors and by applying over-sampling on the training set.},
  author       = {Vandewiele, Gilles and Dehaene, Isabelle and Janssens, Olivier and Ongenae, Femke and De Backere, Femke and De Turck, Filip and Roelens, Kristien and Van Hoecke, Sofie and Demeester, Thomas},
  booktitle    = {ARTIFICIAL INTELLIGENCE IN MEDICINE, AIME 2019},
  editor       = {Riaño, David and Wilk, Szymon and ten Teije, Annette},
  isbn         = {9783030216412},
  issn         = {0302-9743},
  keywords     = {Preterm birth,Electrohysterogram (EHG),Imbalanced data,Over-sampling,PRETERM,TERM},
  language     = {eng},
  location     = {Poznan, Poland},
  pages        = {355--364},
  publisher    = {Springer},
  title        = {A critical look at studies applying over-sampling on the TPEHGDB dataset},
  url          = {http://dx.doi.org/10.1007/978-3-030-21642-9_45},
  volume       = {11526},
  year         = {2019},
}

Altmetric
View in Altmetric
Web of Science
Times cited: