
Data augmentation and semi-supervised learning for deep neural networks-based text classifier
- Author
- Heereen Shim, Stijn Luca (UGent) , Dietwig Lowet and Bart Vanrumste
- Organization
- Abstract
- User feedback is essential for understanding user needs. In this paper, we use free-text obtained from a survey on sleep-related issues to build a deep neural networks-based text classifier. However, to train the deep neural networks model, a lot of labelled data is needed. To reduce manual data labelling, we propose a method which is a combination of data augmentation and pseudo-labelling: data augmentation is applied to labelled data to increase the size of the initial train set and then the trained model is used to annotate unlabelled data with pseudo-labels. The result shows that the model with the data augmentation achieves macro-averaged f1 score of 65.2% while using 4,300 training data, whereas the model without data augmentation achieves macro-averaged f1 score of 68.2% with around 14,000 training data. Furthermore, with the combination of pseudo-labelling, the model achieves macro-averaged f1 score of 62.7% with only using 1,400 training data with labels. In other words, with the proposed method we can reduce the amount of labelled data for training while achieving relatively good performance.
- Keywords
- Text classification, data augmentation, semi-supervised learning, deep neural networks
Downloads
-
SAC2020 - Data Augmentation and Semi-supervised Learning for DNN.pdf
- full text (Published version)
- |
- open access
- |
- |
- 1.36 MB
Citation
Please use this url to cite or link to this publication: http://hdl.handle.net/1854/LU-8656299
- MLA
- Shim, Heereen, et al. “Data Augmentation and Semi-Supervised Learning for Deep Neural Networks-Based Text Classifier.” Proceedings of the 35th Annual ACM Symposium on Applied Computing (SAC’20), Assoc Computing Machinery, 2020, pp. 1119–26, doi:10.1145/3341105.3373992.
- APA
- Shim, H., Luca, S., Lowet, D., & Vanrumste, B. (2020). Data augmentation and semi-supervised learning for deep neural networks-based text classifier. Proceedings of the 35th Annual ACM Symposium on Applied Computing (SAC’20), 1119–1126. https://doi.org/10.1145/3341105.3373992
- Chicago author-date
- Shim, Heereen, Stijn Luca, Dietwig Lowet, and Bart Vanrumste. 2020. “Data Augmentation and Semi-Supervised Learning for Deep Neural Networks-Based Text Classifier.” In Proceedings of the 35th Annual ACM Symposium on Applied Computing (SAC’20), 1119–26. New York: Assoc Computing Machinery. https://doi.org/10.1145/3341105.3373992.
- Chicago author-date (all authors)
- Shim, Heereen, Stijn Luca, Dietwig Lowet, and Bart Vanrumste. 2020. “Data Augmentation and Semi-Supervised Learning for Deep Neural Networks-Based Text Classifier.” In Proceedings of the 35th Annual ACM Symposium on Applied Computing (SAC’20), 1119–1126. New York: Assoc Computing Machinery. doi:10.1145/3341105.3373992.
- Vancouver
- 1.Shim H, Luca S, Lowet D, Vanrumste B. Data augmentation and semi-supervised learning for deep neural networks-based text classifier. In: Proceedings of the 35th Annual ACM Symposium on Applied Computing (SAC’20). New York: Assoc Computing Machinery; 2020. p. 1119–26.
- IEEE
- [1]H. Shim, S. Luca, D. Lowet, and B. Vanrumste, “Data augmentation and semi-supervised learning for deep neural networks-based text classifier,” in Proceedings of the 35th Annual ACM Symposium on Applied Computing (SAC’20), Czech Tech Univ, Electr Network, 2020, pp. 1119–1126.
@inproceedings{8656299, abstract = {{User feedback is essential for understanding user needs. In this paper, we use free-text obtained from a survey on sleep-related issues to build a deep neural networks-based text classifier. However, to train the deep neural networks model, a lot of labelled data is needed. To reduce manual data labelling, we propose a method which is a combination of data augmentation and pseudo-labelling: data augmentation is applied to labelled data to increase the size of the initial train set and then the trained model is used to annotate unlabelled data with pseudo-labels. The result shows that the model with the data augmentation achieves macro-averaged f1 score of 65.2% while using 4,300 training data, whereas the model without data augmentation achieves macro-averaged f1 score of 68.2% with around 14,000 training data. Furthermore, with the combination of pseudo-labelling, the model achieves macro-averaged f1 score of 62.7% with only using 1,400 training data with labels. In other words, with the proposed method we can reduce the amount of labelled data for training while achieving relatively good performance.}}, author = {{Shim, Heereen and Luca, Stijn and Lowet, Dietwig and Vanrumste, Bart}}, booktitle = {{Proceedings of the 35th Annual ACM Symposium on Applied Computing (SAC'20)}}, isbn = {{9781450368667}}, keywords = {{Text classification,data augmentation,semi-supervised learning,deep neural networks}}, language = {{eng}}, location = {{Czech Tech Univ, Electr Network}}, pages = {{1119--1126}}, publisher = {{Assoc Computing Machinery}}, title = {{Data augmentation and semi-supervised learning for deep neural networks-based text classifier}}, url = {{http://doi.org/10.1145/3341105.3373992}}, year = {{2020}}, }
- Altmetric
- View in Altmetric
- Web of Science
- Times cited: