Advanced search
1 file | 1.20 MB

All mixed up? Finding the optimal feature set for general readability prediction and its application to English and Dutch

Orphée De Clercq (UGent) and Veronique Hoste (UGent)
(2016) COMPUTATIONAL LINGUISTICS. 42(3). p.457-490
Author
Organization
Project
LT3
Abstract
Readability research has a long and rich tradition, but there has been too little focus on general readability prediction without targeting a specific audience or text genre. Moreover, though NLP-inspired research has focused on adding more complex readability features there is still no consensus on which features contribute most to the prediction. In this article, we investigate in close detail the feasibility of constructing a readability prediction system for English and Dutch generic text using supervised machine learning. Based on readability assessments by both experts and a crowd, we implement different types of text characteristics ranging from easy-to-compute superficial text characteristics to features requiring a deep linguistic processing, resulting in ten different feature groups. Both a regression and classification setup are investigated reflecting the two possible readability prediction tasks: scoring individual texts or comparing two texts. We show that going beyond correlation calculations for readability optimization using a wrapper-based genetic algorithm optimization approach is a promising task which provides considerable insights in which feature combinations contribute to the overall readability prediction. Since we also have gold standard information available for those features requiring deep processing we are able to investigate the true upper bound of our Dutch system. Interestingly, we will observe that the performance of our fully-automatic readability prediction pipeline is on par with the pipeline using golden deep syntactic and semantic information.
Keywords
LANGUAGE, COHESION, TEXTS, COH-METRIX, FORMULAS, DIFFICULTY

Downloads

  • coli a 00255.pdf
    • full text
    • |
    • open access
    • |
    • PDF
    • |
    • 1.20 MB

Citation

Please use this url to cite or link to this publication:

Chicago
De Clercq, Orphée, and Veronique Hoste. 2016. “All Mixed up? Finding the Optimal Feature Set for General Readability Prediction and Its Application to English and Dutch.” Computational Linguistics 42 (3): 457–490.
APA
De Clercq, Orphée, & Hoste, V. (2016). All mixed up? Finding the optimal feature set for general readability prediction and its application to English and Dutch. COMPUTATIONAL LINGUISTICS, 42(3), 457–490.
Vancouver
1.
De Clercq O, Hoste V. All mixed up? Finding the optimal feature set for general readability prediction and its application to English and Dutch. COMPUTATIONAL LINGUISTICS. MIT Press; 2016;42(3):457–90.
MLA
De Clercq, Orphée, and Veronique Hoste. “All Mixed up? Finding the Optimal Feature Set for General Readability Prediction and Its Application to English and Dutch.” COMPUTATIONAL LINGUISTICS 42.3 (2016): 457–490. Print.
@article{7175390,
  abstract     = {Readability research has a long and rich tradition, but there has been too little focus on general readability prediction without targeting a specific audience or text genre. Moreover, though NLP-inspired research has focused on adding more complex readability features there is still no consensus on which features contribute most to the prediction. In this article, we investigate in close detail the feasibility of constructing a readability prediction system for English and Dutch generic text using supervised machine learning. Based on readability assessments by both experts
and a crowd, we implement different types of text characteristics ranging from easy-to-compute superficial text characteristics to features requiring a deep linguistic processing, resulting in ten
different feature groups. Both a regression and classification setup are investigated reflecting the two possible readability prediction tasks: scoring individual texts or comparing two texts. We show that going beyond correlation calculations for readability optimization using a wrapper-based genetic algorithm optimization approach is a promising task which provides considerable insights in which feature combinations contribute to the overall readability prediction. Since we also have gold standard information available for those features requiring deep processing we are able to investigate the true upper bound of our Dutch system. Interestingly, we will observe that the performance of our fully-automatic readability prediction pipeline is on par with the pipeline using golden deep syntactic and semantic information.},
  author       = {De Clercq, Orph{\'e}e and Hoste, Veronique},
  issn         = {0891-2017},
  journal      = {COMPUTATIONAL LINGUISTICS},
  keyword      = {LANGUAGE,COHESION,TEXTS,COH-METRIX,FORMULAS,DIFFICULTY},
  language     = {eng},
  number       = {3},
  pages        = {457--490},
  publisher    = {MIT Press},
  title        = {All mixed up? Finding the optimal feature set for general readability prediction and its application to English and Dutch},
  url          = {http://dx.doi.org/10.1162/COLI\_a\_00255},
  volume       = {42},
  year         = {2016},
}

Altmetric
View in Altmetric
Web of Science
Times cited: