Advanced search
1 file | 429.73 KB Add to list
Author
Organization
Abstract
Compounding is one of the most productive word formation processes in many languages and is therefore a main source of data sparsity in language modeling. Many solutions have been suggested to model compound words, most of which break the compound into its constituents and train a new model with them. In earlier work, we argued that this approach is suboptimal and we presented a novel technique that clusters new, domain-specific compound words together with their semantic heads. The clusters were then used to build a class-based n-grarn model that enabled a reliable estimation of n-grarn probabilities, without the need for additional training data. In this paper, we investigate how this "semantic head mapping" can best be made an integral part of the language modeling strategy and find that, with some adaptations, our technique is capable of producing more accurate compound probability estimates than a baseline word-based n-gram language model, which lead to a significant word error rate reduction for Dutch read speech.
Keywords
n-grams, data sparsity, LVCSR, language models, word clusters

Downloads

  • (...).pdf
    • full text
    • |
    • UGent only
    • |
    • PDF
    • |
    • 429.73 KB

Citation

Please use this url to cite or link to this publication:

MLA
Pelemans, Joris, et al. “Improving N-Gram Probability Estimates by Compound-Head Clustering.” International Conference on Acoustics Speech and Signal Processing ICASSP, IEEE, 2015, pp. 5221–25.
APA
Pelemans, J., Demuynck, K., Van Hamme, H., & Wambacq, P. (2015). Improving N-Gram probability estimates by compound-head clustering. In International Conference on Acoustics Speech and Signal Processing ICASSP (pp. 5221–5225). NEW YORK: IEEE.
Chicago author-date
Pelemans, Joris, Kris Demuynck, Hugo Van Hamme, and Patrick Wambacq. 2015. “Improving N-Gram Probability Estimates by Compound-Head Clustering.” In International Conference on Acoustics Speech and Signal Processing ICASSP, 5221–25. NEW YORK: IEEE.
Chicago author-date (all authors)
Pelemans, Joris, Kris Demuynck, Hugo Van Hamme, and Patrick Wambacq. 2015. “Improving N-Gram Probability Estimates by Compound-Head Clustering.” In International Conference on Acoustics Speech and Signal Processing ICASSP, 5221–5225. NEW YORK: IEEE.
Vancouver
1.
Pelemans J, Demuynck K, Van Hamme H, Wambacq P. Improving N-Gram probability estimates by compound-head clustering. In: International Conference on Acoustics Speech and Signal Processing ICASSP. NEW YORK: IEEE; 2015. p. 5221–5.
IEEE
[1]
J. Pelemans, K. Demuynck, H. Van Hamme, and P. Wambacq, “Improving N-Gram probability estimates by compound-head clustering,” in International Conference on Acoustics Speech and Signal Processing ICASSP, Brisbane, AUSTRALIA, 2015, pp. 5221–5225.
@inproceedings{8057949,
  abstract     = {Compounding is one of the most productive word formation processes in many languages and is therefore a main source of data sparsity in language modeling. Many solutions have been suggested to model compound words, most of which break the compound into its constituents and train a new model with them. In earlier work, we argued that this approach is suboptimal and we presented a novel technique that clusters new, domain-specific compound words together with their semantic heads. The clusters were then used to build a class-based n-grarn model that enabled a reliable estimation of n-grarn probabilities, without the need for additional training data. In this paper, we investigate how this "semantic head mapping" can best be made an integral part of the language modeling strategy and find that, with some adaptations, our technique is capable of producing more accurate compound probability estimates than a baseline word-based n-gram language model, which lead to a significant word error rate reduction for Dutch read speech.},
  author       = {Pelemans, Joris and Demuynck, Kris and Van Hamme, Hugo and Wambacq, Patrick},
  booktitle    = {International Conference on Acoustics Speech and Signal Processing ICASSP},
  isbn         = {978-1-4673-6997-8},
  issn         = {1520-6149},
  keywords     = {n-grams,data sparsity,LVCSR,language models,word clusters},
  language     = {eng},
  location     = {Brisbane, AUSTRALIA},
  pages        = {5221--5225},
  publisher    = {IEEE},
  title        = {Improving N-Gram probability estimates by compound-head clustering},
  year         = {2015},
}

Web of Science
Times cited: