Advanced search
1 file | 334.68 KB Add to list

Identifying cognates in English-Dutch and French-Dutch by means of orthographic information and cross-lingual word embeddings

Els Lefever (UGent) , Sofie Labat (UGent) and Pranaydeep Singh (UGent)
Author
Organization
Abstract
This paper investigates the validity of combining more traditional orthographic information with cross-lingual word embeddings to identify cognate pairs in English-Dutch and French-Dutch. In a first step, lists of potential cognate pairs in English-Dutch and French-Dutch are manually labelled. The resulting gold standard is used to train and evaluate a multi-layer perceptron that can distinguish cognates from non-cognates. Fifteen orthographic features capture string similarities between source and target words, while the cosine similarity between their word embeddings represents the semantic relation between these words. By adding domain-specific information to pretrained fastText embeddings, we are able to obtain good embeddings for words that did not yet have a pretrained embedding (e.g. Dutch compound nouns). These embeddings are then aligned in a cross-lingual vector space by exploiting their structural similarity (cf. adversarial learning). Our results indicate that although the classifier already achieves good results on the basis of orthographic information, the performance further improves by including semantic information in the form of cross-lingual word embeddings.
Keywords
LT3, cognate detection, multi-layer perceptron, orthographic similarity, cross-lingual word embeddings

Downloads

  • LREC2020 Cognates.pdf
    • full text (Published version)
    • |
    • open access
    • |
    • PDF
    • |
    • 334.68 KB

Citation

Please use this url to cite or link to this publication:

MLA
Lefever, Els, et al. “Identifying Cognates in English-Dutch and French-Dutch by Means of Orthographic Information and Cross-Lingual Word Embeddings.” PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), European Language Resources Association (ELRA), 2020, pp. 4096–101.
APA
Lefever, E., Labat, S., & Singh, P. (2020). Identifying cognates in English-Dutch and French-Dutch by means of orthographic information and cross-lingual word embeddings. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 4096–4101. European Language Resources Association (ELRA).
Chicago author-date
Lefever, Els, Sofie Labat, and Pranaydeep Singh. 2020. “Identifying Cognates in English-Dutch and French-Dutch by Means of Orthographic Information and Cross-Lingual Word Embeddings.” In PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 4096–4101. European Language Resources Association (ELRA).
Chicago author-date (all authors)
Lefever, Els, Sofie Labat, and Pranaydeep Singh. 2020. “Identifying Cognates in English-Dutch and French-Dutch by Means of Orthographic Information and Cross-Lingual Word Embeddings.” In PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 4096–4101. European Language Resources Association (ELRA).
Vancouver
1.
Lefever E, Labat S, Singh P. Identifying cognates in English-Dutch and French-Dutch by means of orthographic information and cross-lingual word embeddings. In: PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020). European Language Resources Association (ELRA); 2020. p. 4096–101.
IEEE
[1]
E. Lefever, S. Labat, and P. Singh, “Identifying cognates in English-Dutch and French-Dutch by means of orthographic information and cross-lingual word embeddings,” in PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), Marseille, France, 2020, pp. 4096–4101.
@inproceedings{8662200,
  abstract     = {{This paper investigates the validity of combining more traditional orthographic information with cross-lingual word embeddings to identify cognate pairs in English-Dutch and French-Dutch. In a first step, lists of potential cognate pairs in English-Dutch and French-Dutch are manually labelled. The resulting gold standard is used to train and evaluate a multi-layer perceptron that can distinguish cognates from non-cognates. Fifteen orthographic features capture string similarities between source and target words, while the cosine similarity between their word embeddings represents the semantic relation between these words. By adding domain-specific information to pretrained fastText embeddings, we are able to obtain good embeddings for words that did not yet have a pretrained embedding (e.g. Dutch compound nouns). These embeddings are then aligned in a cross-lingual vector space by exploiting their structural similarity (cf. adversarial learning). Our results indicate that although the classifier already achieves good results on the basis of orthographic information, the performance further improves by including semantic information in the form of cross-lingual word embeddings.}},
  author       = {{Lefever, Els and Labat, Sofie and Singh, Pranaydeep}},
  booktitle    = {{PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020)}},
  isbn         = {{9791095546344}},
  issn         = {{2522-2686}},
  keywords     = {{LT3,cognate detection,multi-layer perceptron,orthographic similarity,cross-lingual word embeddings}},
  language     = {{eng}},
  location     = {{Marseille, France}},
  pages        = {{4096--4101}},
  publisher    = {{European Language Resources Association (ELRA)}},
  title        = {{Identifying cognates in English-Dutch and French-Dutch by means of orthographic information and cross-lingual word embeddings}},
  url          = {{http://www.lrec-conf.org/proceedings/lrec2020/LREC-2020.pdf}},
  year         = {{2020}},
}

Web of Science
Times cited: