Mention-level gene normalization on multi-species and multiple identifiers유전자 언급의 개별 단위 다종 및 다수 식별자 파악

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 473
  • Download : 0
Document-level gene normalization (GN), which produces gene identifiers given an input document, helps database curators to search for relevant articles with genes of interest as a query. Recent advances in the automatic extraction of information from the biology literature call for mention-level GN systems of finding gene identifiers relevant to gene mentions, since a piece of extracted information is likely to be relevant to only some, but not all, of the genes mentioned in a given article. However, except for early GN research that evaluat-ed GN systems on the mention level, there are no studies on mention-level GN. In this thesis, we propose the need to look into gene normalization specifically on the mention level. For this purpose, we constructed mention-level annotations and explained the annotation process in detail. After constructing the annotation data, we analyzed the characteristics of the mention-level annotation dataset. Among the characteristics we analyzed, we found that there were many gene mentions that indicated not just single gene identifier but multiple gene identifiers. We concluded that these mentions with multiple gene identifi-ers are one of the great features of mention-level GN and proposed methods for dealing with them. We proposed a rule-based method and a machine-learning method. The rule-based method first divides mentions with multi-ple gene identifiers into four cases (homologous genes, family genes, coordination genes, and combinations of the three). In addition, it recognizes each case based on its mention string and assigns each case accordingly. The machine-learning method trains several features of mentions with multiple gene identifiers and classifies a candidate gene identifier based on whether it belongs to its gene mention. The evaluation results show that our methods enhance the performance of baseline systems to a meaningful degree, but also that the machine-learning method is better.
Advisors
Park, Jong-Cheolresearcher박종철
Description
한국과학기술원 : 전산학과,
Publisher
한국과학기술원
Issue Date
2014
Identifier
569324/325007  / 020123157
Language
eng
Description

학위논문(석사) - 한국과학기술원 : 전산학과, 2014.2, [ iv, 35 p. ]

Keywords

Annotation; BioCreative; 유전자 정규화; Gene Normalization; 말뭉치

URI
http://hdl.handle.net/10203/196896
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=569324&flag=dissertation
Appears in Collection
CS-Theses_Master(석사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0