Ghent University Academic Bibliography

Advanced

ParaSense: parallel corpora for word sense disambiguation

Els Lefever UGent (2012)
abstract
This thesis presents a machine learning approach to Word Sense Disambiguation (WSD), the task that consists in selecting the correct sense of an ambiguous word in a given context. We recast the task of disambiguating polysemous nouns as a multilingual classification task. Instead of using a predefined monolingual sense inventory such as WordNet, we use a language-independent framework where the word senses are derived automatically from word alignments on a parallel corpus. As a consequence, the task is turned into a cross-lingual WSD task, that consists in selecting the contextually correct translation of an ambiguous target word. In order to evaluate the viability of cross-lingual Word Sense Disambiguation, we constructed a lexical sample data set of twenty ambiguous nouns. For the creation of the multilingual sense inventory, we first applied word alignment to a six-lingual parallel corpus and manually clustered the obtained translations by meaning for all target words. The resulting multilingual sense inventory then served as the basis for the annotation of the test data. The ParaSense WSD system we propose in this thesis presents a truly multilingual classification-based approach to WSD that directly incorporates evidence from four other languages. We built five classifiers with English as an input language and translations in the five supported languages (viz. French, Dutch, Italian, Spanish and German) as classification output. The feature vectors incorporate both local context features as well as translation features that are extracted from the aligned translations. The hypothesis underlying the construction of a multilingual WSD system is that adding translational evidence from multiple languages will be more informative than using only monolingual or bilingual information. We believe it is possible to use the differences between the languages to obtain certain leverage on word meanings and better disambiguate a polysemous word in a given context. The experimental results confirm the validity of our approach: the classifiers that employ translational evidence constantly outperform the classifiers that only exploit local context information for four out of five target languages, viz. French, Spanish, German and Dutch. Furthermore, a comparison with all systems that participated in a dedicated cross-lingual Word Sense Disambiguation competition revealed that the ParaSense system outperforms all other systems for all five target languages. As our system extracts all information from the parallel corpus at hand, it is a very flexible and language-independent approach that allows to bypass the knowledge acquisition bottleneck for Word Sense Disambiguation.
Please use this url to cite or link to this publication:
author
promoter
UGent and UGent
organization
alternative title
Het gebruik van parallelle corpora voor het automatisch desambigueren van polyseme woorden
year
type
dissertation (monograph)
subject
keyword
Parallel corpora, natural language processing, Word Sense Disambiguation
pages
XI, 205 pages
publisher
Ghent University. Faculty of Sciences
place of publication
Ghent, Belgium
defense location
Gent : Het Pand (zaal rector Blancquaert)
defense date
2012-09-24 16:00
ISBN
9789070830908
language
English
UGent publication?
yes
classification
D1
copyright statement
I have retained and own the full copyright for this publication
id
2999961
handle
http://hdl.handle.net/1854/LU-2999961
date created
2012-09-27 10:05:59
date last changed
2014-02-05 11:12:46
@phdthesis{2999961,
  abstract     = {This thesis presents a machine learning approach to Word Sense Disambiguation (WSD), the task that consists in selecting the correct sense of an ambiguous word in a given context.
We recast the task of disambiguating polysemous nouns as a multilingual classification task. Instead of using a predefined monolingual sense inventory such as WordNet, we use a language-independent framework where the word senses are derived automatically from word alignments on a parallel corpus. As a consequence, the task is turned into a cross-lingual WSD task, that consists in selecting the contextually correct translation of an ambiguous target word.
In order to evaluate the viability of cross-lingual Word Sense Disambiguation, we constructed a lexical sample data set of twenty ambiguous nouns. For the creation of the multilingual sense inventory, we first applied word alignment to a six-lingual parallel corpus and manually clustered the obtained translations by meaning for all target words. The resulting multilingual sense inventory then served as the basis for the annotation of the test data. 
The ParaSense WSD system we propose in this thesis presents a truly multilingual classification-based approach to WSD that directly incorporates evidence from four other languages. We built five classifiers with English as an input language and translations in the five supported languages (viz. French, Dutch, Italian, Spanish and German) as classification output. The feature vectors incorporate both local context features as well as translation features that are extracted from the aligned translations.
The hypothesis underlying the construction of a multilingual WSD system is that adding translational evidence from multiple languages will be more informative than using only monolingual or bilingual information. We believe it is possible to use the differences between the languages to obtain certain leverage on word meanings and better disambiguate a polysemous word in a given context.
The experimental results confirm the validity of our approach: the classifiers that employ translational evidence constantly outperform the classifiers that only exploit local context information for four out of five target languages, viz. French, Spanish, German and Dutch. Furthermore, a comparison with all systems that participated in a dedicated cross-lingual Word Sense Disambiguation competition revealed that the ParaSense system outperforms all other systems for all five target languages.
As our system extracts all information from the parallel corpus at hand, it is a very flexible and language-independent approach that allows to bypass the knowledge acquisition bottleneck for Word Sense Disambiguation.},
  author       = {Lefever, Els},
  isbn         = {9789070830908},
  keyword      = {Parallel corpora,natural language processing,Word Sense Disambiguation},
  language     = {eng},
  pages        = {XI, 205},
  publisher    = {Ghent University. Faculty of Sciences},
  school       = {Ghent University},
  title        = {ParaSense: parallel corpora for word sense disambiguation},
  year         = {2012},
}

Chicago
Lefever, Els. 2012. “ParaSense: Parallel Corpora for Word Sense Disambiguation”. Ghent, Belgium: Ghent University. Faculty of Sciences.
APA
Lefever, E. (2012). ParaSense: parallel corpora for word sense disambiguation. Ghent University. Faculty of Sciences, Ghent, Belgium.
Vancouver
1.
Lefever E. ParaSense: parallel corpora for word sense disambiguation. [Ghent, Belgium]: Ghent University. Faculty of Sciences; 2012.
MLA
Lefever, Els. “ParaSense: Parallel Corpora for Word Sense Disambiguation.” 2012 : n. pag. Print.