Advanced search
2 files | 9.68 MB Add to list

A self-training approach for short text clustering

Amir Hadifar (UGent) , Lucas Sterckx (UGent) , Thomas Demeester (UGent) and Chris Develder (UGent)
Author
Organization
Abstract
Short text clustering is a challenging problem when adopting traditional bag-of-words or TF-IDF representations, since these lead to sparse vector representations for short texts. Low-dimensional continuous representations or embeddings can counter that sparseness problem: their high representational power is exploited in deep clustering algorithms. While deep clustering has been studied extensively in computer vision, relatively little work has focused on NLP. The method we propose, learns discriminative features from both an autoencoder and a sentence embedding, then uses assignments from a clustering algorithm as supervision to update weights of the encoder network. Experiments on three short text datasets empirically validate the effectiveness of our method.

Downloads

  • (...).pdf
    • full text (Published version)
    • |
    • UGent only
    • |
    • PDF
    • |
    • 4.81 MB
  • (...).pdf
    • full text (Published version)
    • |
    • UGent only
    • |
    • PDF
    • |
    • 4.87 MB

Citation

Please use this url to cite or link to this publication:

MLA
Hadifar, Amir, et al. “A Self-Training Approach for Short Text Clustering.” 4TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP (REPL4NLP-2019), Association for Computational Linguistics (ACL), 2019, pp. 194–99.
APA
Hadifar, A., Sterckx, L., Demeester, T., & Develder, C. (2019). A self-training approach for short text clustering. 4TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP (REPL4NLP-2019), 194–199. Association for Computational Linguistics (ACL).
Chicago author-date
Hadifar, Amir, Lucas Sterckx, Thomas Demeester, and Chris Develder. 2019. “A Self-Training Approach for Short Text Clustering.” In 4TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP (REPL4NLP-2019), 194–99. Association for Computational Linguistics (ACL).
Chicago author-date (all authors)
Hadifar, Amir, Lucas Sterckx, Thomas Demeester, and Chris Develder. 2019. “A Self-Training Approach for Short Text Clustering.” In 4TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP (REPL4NLP-2019), 194–199. Association for Computational Linguistics (ACL).
Vancouver
1.
Hadifar A, Sterckx L, Demeester T, Develder C. A self-training approach for short text clustering. In: 4TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP (REPL4NLP-2019). Association for Computational Linguistics (ACL); 2019. p. 194–9.
IEEE
[1]
A. Hadifar, L. Sterckx, T. Demeester, and C. Develder, “A self-training approach for short text clustering,” in 4TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP (REPL4NLP-2019), Florence, Italy, 2019, pp. 194–199.
@inproceedings{8621468,
  abstract     = {{Short text clustering is a challenging problem when adopting traditional bag-of-words or TF-IDF representations, since these lead to sparse vector representations for short texts. Low-dimensional continuous representations or embeddings can counter that sparseness problem: their high representational power is exploited in deep clustering algorithms. While deep clustering has been studied extensively in computer vision, relatively little work has focused on NLP. The method we propose, learns discriminative features from both an autoencoder and a sentence embedding, then uses assignments from a clustering algorithm as supervision to update weights of the encoder network. Experiments on three short text datasets empirically validate the effectiveness of our method.}},
  author       = {{Hadifar, Amir and Sterckx, Lucas and Demeester, Thomas and Develder, Chris}},
  booktitle    = {{4TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP (REPL4NLP-2019)}},
  isbn         = {{9781950737352}},
  language     = {{eng}},
  location     = {{Florence, Italy}},
  pages        = {{194--199}},
  publisher    = {{Association for Computational Linguistics (ACL)}},
  title        = {{A self-training approach for short text clustering}},
  year         = {{2019}},
}

Web of Science
Times cited: