Temporal information extraction from Korean texts

Due to the increasing number of unstructured documents available on the Web and from other sources, developing techniques that automatically extract knowledge from the documents has been of paramount importance. Among many aspects of extracting knowledge from documents, the extraction of temporal information is recently drawing much attention, since the documents usually incorporate temporal information that is useful for further applications such as Information Retrieval (IR) and Question Answering (QA) systems. Given a simple question, ``who was the president of the U.S. 8 years ago?'', for example, a QA system may have a difficulty in finding the right answer without the correct temporal information about when the question is posed and what `8 years ago' refers to. To prior to the task of the temporal information extraction, it is required to define a representation scheme or an annotation language of the temporal information. The most popular annotation languages are TimeML and ISO-TimeML. Although they are desinged to represent various types of temporal information, they do not consider language diversity. That is, for language-specific characteristics, there are some languages that can not be properly annotated using the TimeML and ISO-TimeML. Korean language is one of such languages, so Korean TimeML (KTimeML) was proposed in 2009. However, the KTimeML also has some limitations. For example, it does not consider a lunar calendar although the temporal expressions of the lunar calendar appear often in Korean texts. It is also based on a morpheme-level annotation which is not practical to data distribution or data sharing. In this dissertation, a revised version of the KTimeML is proposed, and Korean TimeBank, which is constructed using a part of the new KTimeML, is proposed. With the Korean TimeBank, a system for temporal information extraction, namely ExoTime, is developed. Several Korean-specific challenging issues are discussed, and it will be explained how these issues are addressed by the proposed system. The proposed system makes use of Korean analyzer which gives POS tags, NE tags and results of dependency parsing. As the performance of Korean analyzer is not stable compared to the tools for English language, a new method for generating complementary features is also proposed. The complementary feature generation method is a data-driven model designed to be available to any language, and it generates syntactic and semantic features in an unsupervised way. The proposed system will have a huge impact on industry and various research fields, because the documents usually have the temporal information which must be useful for various applications.
