Advanced search
1 file | 187.99 KB

Cell line name recognition in support of the identification of synthetic lethality in cancer from text

(2016) BIOINFORMATICS. 32(2). p.276-282
Author
Organization
Project
Bioinformatics: from nucleotids to networks (N2N)
Abstract
Motivation: The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus. Results: We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers.
Keywords
SHARED TASK

Downloads

  • Kaewphan et al. 2016 Bioinformatics 32 276.pdf
    • full text
    • |
    • open access
    • |
    • PDF
    • |
    • 187.99 KB

Citation

Please use this url to cite or link to this publication:

Chicago
Kaewphan, Suwisa, Sofie Van Landeghem, Tomoko Ohta, Yves Van de Peer, Filip Ginter, and Sampo Pyysalo. 2016. “Cell Line Name Recognition in Support of the Identification of Synthetic Lethality in Cancer from Text.” Bioinformatics 32 (2): 276–282.
APA
Kaewphan, S., Van Landeghem, S., Ohta, T., Van de Peer, Y., Ginter, F., & Pyysalo, S. (2016). Cell line name recognition in support of the identification of synthetic lethality in cancer from text. BIOINFORMATICS, 32(2), 276–282.
Vancouver
1.
Kaewphan S, Van Landeghem S, Ohta T, Van de Peer Y, Ginter F, Pyysalo S. Cell line name recognition in support of the identification of synthetic lethality in cancer from text. BIOINFORMATICS. 2016;32(2):276–82.
MLA
Kaewphan, Suwisa, Sofie Van Landeghem, Tomoko Ohta, et al. “Cell Line Name Recognition in Support of the Identification of Synthetic Lethality in Cancer from Text.” BIOINFORMATICS 32.2 (2016): 276–282. Print.
@article{7081746,
  abstract     = {Motivation: The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus. 
Results: We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46\% on the test set of Gellus and 85.98\% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers.},
  author       = {Kaewphan, Suwisa and Van Landeghem, Sofie and Ohta, Tomoko and Van de Peer, Yves and Ginter, Filip and Pyysalo, Sampo},
  issn         = {1367-4803},
  journal      = {BIOINFORMATICS},
  keyword      = {SHARED TASK},
  language     = {eng},
  number       = {2},
  pages        = {276--282},
  title        = {Cell line name recognition in support of the identification of synthetic lethality in cancer from text},
  url          = {http://dx.doi.org/10.1093/bioinformatics/btv570},
  volume       = {32},
  year         = {2016},
}

Altmetric
View in Altmetric
Web of Science
Times cited: