Advanced search
1 file | 88.88 KB Add to list

Coping with language data sparsity: semantic head mapping for compound words

Author
Organization
Abstract
in this paper we present a novel clustering technique for compound words. By mapping compounds onto their semantic heads, the technique is able to estimate n-gram probabilities for unseen compounds. We argue that compounds are well represented by their heads which allows the clustering of rare words and reduces the risk of over-generalization. The semantic heads arc obtained by a two-step process which consists of constituent generation and best head selection based on corpus statistics. Experiments on Dutch read speech show that our technique is capable of correctly identifying compounds and their semantic heads with a precision of 80.25% and a recall of 85.97%. A class-based language model with compound-head clusters achieves a significant reduction in both perplexity and WER.
Keywords
OOV, clustering, sparsity, n-grams, compounds

Downloads

  • (...).pdf
    • full text
    • |
    • UGent only
    • |
    • PDF
    • |
    • 88.88 KB

Citation

Please use this url to cite or link to this publication:

MLA
Pelemans, Joris, et al. “Coping with Language Data Sparsity: Semantic Head Mapping for Compound Words.” International Conference on Acoustics Speech and Signal Processing ICASSP, IEEE, 2014, pp. 141–45.
APA
Pelemans, J., Demuynck, K., Van hamme, H., & Wambacq, P. (2014). Coping with language data sparsity: semantic head mapping for compound words. International Conference on Acoustics Speech and Signal Processing ICASSP, 141–145. IEEE.
Chicago author-date
Pelemans, Joris, Kris Demuynck, Hugo Van hamme, and Patrick Wambacq. 2014. “Coping with Language Data Sparsity: Semantic Head Mapping for Compound Words.” In International Conference on Acoustics Speech and Signal Processing ICASSP, 141–45. IEEE.
Chicago author-date (all authors)
Pelemans, Joris, Kris Demuynck, Hugo Van hamme, and Patrick Wambacq. 2014. “Coping with Language Data Sparsity: Semantic Head Mapping for Compound Words.” In International Conference on Acoustics Speech and Signal Processing ICASSP, 141–145. IEEE.
Vancouver
1.
Pelemans J, Demuynck K, Van hamme H, Wambacq P. Coping with language data sparsity: semantic head mapping for compound words. In: International Conference on Acoustics Speech and Signal Processing ICASSP. IEEE; 2014. p. 141–5.
IEEE
[1]
J. Pelemans, K. Demuynck, H. Van hamme, and P. Wambacq, “Coping with language data sparsity: semantic head mapping for compound words,” in International Conference on Acoustics Speech and Signal Processing ICASSP, Florence, Italy, 2014, pp. 141–145.
@inproceedings{4404017,
  abstract     = {{in this paper we present a novel clustering technique for compound words. By mapping compounds onto their semantic heads, the technique is able to estimate n-gram probabilities for unseen compounds. We argue that compounds are well represented by their heads which allows the clustering of rare words and reduces the risk of over-generalization. The semantic heads arc obtained by a two-step process which consists of constituent generation and best head selection based on corpus statistics. Experiments on Dutch read speech show that our technique is capable of correctly identifying compounds and their semantic heads with a precision of 80.25% and a recall of 85.97%. A class-based language model with compound-head clusters achieves a significant reduction in both perplexity and WER.}},
  author       = {{Pelemans, Joris and Demuynck, Kris and Van hamme, Hugo and Wambacq, Patrick}},
  booktitle    = {{International Conference on Acoustics Speech and Signal Processing ICASSP}},
  isbn         = {{9781479928934}},
  issn         = {{1520-6149}},
  keywords     = {{OOV,clustering,sparsity,n-grams,compounds}},
  language     = {{eng}},
  location     = {{Florence, Italy}},
  pages        = {{141--145}},
  publisher    = {{IEEE}},
  title        = {{Coping with language data sparsity: semantic head mapping for compound words}},
  year         = {{2014}},
}

Web of Science
Times cited: