Advanced search
1 file | 1.95 MB Add to list

Hellinger distance decision trees for PU learning in imbalanced data sets

(2024) MACHINE LEARNING. 113(7). p.4547-4578
Author
Organization
Abstract
Learning from positive and unlabeled data, or PU learning, is the setting in which a binary classifier can only train from positive and unlabeled instances, the latter containing both positive as well as negative instances. Many PU applications, e.g., fraud detection, are also characterized by class imbalance, which creates a challenging setting. Not only are fewer minority class examples compared to the case where all labels are known, there is also only a small fraction of unlabeled observations that would actually be positive. Despite the relevance of the topic, only a few studies have considered a class imbalance setting in PU learning. In this paper, we propose a novel technique that can directly handle imbalanced PU data, named the PU Hellinger Decision Tree (PU-HDT). Our technique exploits the class prior to estimate the counts of positives and negatives in every node in the tree. Moreover, the Hellinger distance is used instead of more conventional splitting criteria because it has been shown to be class-imbalance insensitive. This simple yet effective adaptation allows PU-HDT to perform well in highly imbalanced PU data sets. We also introduce PU Stratified Hellinger Random Forest (PU-SHRF), which uses PU-HDT as its base learner and integrates a stratified bootstrap sampling. Our empirical analysis shows that PU-SHRF substantially outperforms state-of-the-art PU learning methods for imbalanced data sets in most experimental settings.
Keywords
Artificial Intelligence, Software, PU Learning, Weakly supervised learning, Imbalanced classification, Ensemble learning, SUPERVISED AUC OPTIMIZATION, CLASSIFICATION, ALGORITHMS, SMOTE, SVM

Downloads

  • (...).pdf
    • full text (Published version)
    • |
    • UGent only
    • |
    • PDF
    • |
    • 1.95 MB

Citation

Please use this url to cite or link to this publication:

MLA
Ortega Vázquez, Carlos, et al. “Hellinger Distance Decision Trees for PU Learning in Imbalanced Data Sets.” MACHINE LEARNING, vol. 113, no. 7, 2024, pp. 4547–78, doi:10.1007/s10994-023-06323-y.
APA
Ortega Vázquez, C., vanden Broucke, S., & De Weerdt, J. (2024). Hellinger distance decision trees for PU learning in imbalanced data sets. MACHINE LEARNING, 113(7), 4547–4578. https://doi.org/10.1007/s10994-023-06323-y
Chicago author-date
Ortega Vázquez, Carlos, Seppe vanden Broucke, and Jochen De Weerdt. 2024. “Hellinger Distance Decision Trees for PU Learning in Imbalanced Data Sets.” MACHINE LEARNING 113 (7): 4547–78. https://doi.org/10.1007/s10994-023-06323-y.
Chicago author-date (all authors)
Ortega Vázquez, Carlos, Seppe vanden Broucke, and Jochen De Weerdt. 2024. “Hellinger Distance Decision Trees for PU Learning in Imbalanced Data Sets.” MACHINE LEARNING 113 (7): 4547–4578. doi:10.1007/s10994-023-06323-y.
Vancouver
1.
Ortega Vázquez C, vanden Broucke S, De Weerdt J. Hellinger distance decision trees for PU learning in imbalanced data sets. MACHINE LEARNING. 2024;113(7):4547–78.
IEEE
[1]
C. Ortega Vázquez, S. vanden Broucke, and J. De Weerdt, “Hellinger distance decision trees for PU learning in imbalanced data sets,” MACHINE LEARNING, vol. 113, no. 7, pp. 4547–4578, 2024.
@article{01HJB0HTCVB3138QQ5CZMN0S5J,
  abstract     = {{Learning from positive and unlabeled data, or PU learning, is the setting in which a binary classifier can only train from positive and unlabeled instances, the latter containing both positive as well as negative instances. Many PU applications, e.g., fraud detection, are also characterized by class imbalance, which creates a challenging setting. Not only are fewer minority class examples compared to the case where all labels are known, there is also only a small fraction of unlabeled observations that would actually be positive. Despite the relevance of the topic, only a few studies have considered a class imbalance setting in PU learning. In this paper, we propose a novel technique that can directly handle imbalanced PU data, named the PU Hellinger Decision Tree (PU-HDT). Our technique exploits the class prior to estimate the counts of positives and negatives in every node in the tree. Moreover, the Hellinger distance is used instead of more conventional splitting criteria because it has been shown to be class-imbalance insensitive. This simple yet effective adaptation allows PU-HDT to perform well in highly imbalanced PU data sets. We also introduce PU Stratified Hellinger Random Forest (PU-SHRF), which uses PU-HDT as its base learner and integrates a stratified bootstrap sampling. Our empirical analysis shows that PU-SHRF substantially outperforms state-of-the-art PU learning methods for imbalanced data sets in most experimental settings.}},
  author       = {{Ortega Vázquez, Carlos and vanden Broucke, Seppe and De Weerdt, Jochen}},
  issn         = {{0885-6125}},
  journal      = {{MACHINE LEARNING}},
  keywords     = {{Artificial Intelligence,Software,PU Learning,Weakly supervised learning,Imbalanced classification,Ensemble learning,SUPERVISED AUC OPTIMIZATION,CLASSIFICATION,ALGORITHMS,SMOTE,SVM}},
  language     = {{eng}},
  number       = {{7}},
  pages        = {{4547--4578}},
  title        = {{Hellinger distance decision trees for PU learning in imbalanced data sets}},
  url          = {{http://doi.org/10.1007/s10994-023-06323-y}},
  volume       = {{113}},
  year         = {{2024}},
}

Altmetric
View in Altmetric
Web of Science
Times cited: