An efficient pre-processing method to identify logical components from PDF documents

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 820
  • Download : 0
DC FieldValueLanguage
dc.contributor.authorLiu, Yingko
dc.contributor.authorBai, Kunko
dc.contributor.authorGao, Liangcaiko
dc.date.accessioned2013-03-11T17:47:22Z-
dc.date.available2013-03-11T17:47:22Z-
dc.date.created2012-02-06-
dc.date.created2012-02-06-
dc.date.issued2011-
dc.identifier.citationLECTURE NOTES IN COMPUTER SCIENCE (INCLUDING SUBSERIES LECTURE NOTES IN ARTIFICIAL INTELLIGENCE AND LECTURE NOTES IN BIOINFORMATICS), v.6634 LNAI, no.PART 1, pp.500 - 511-
dc.identifier.issn0302-9743-
dc.identifier.urihttp://hdl.handle.net/10203/99775-
dc.description.abstractAs the rapid growth of the scientific documents in digital libraries, the search demands for the documents as well as specific components increase dramatically. Accurately detecting the component boundary is of vital importance to all the further information extraction and applications. However, document component boundary detection (especially the table, figure, and equation) is a challenging problem because there is no standardized formats and layouts across diverse documents. This paper presents an efficient document preprocessing technique to improve the document component boundary detection performance by taking advantage of the nature of document lines. Our method easily simplifies the component boundary detection problem into the sparse line analysis problem with much less noise. We define eight document line label types and apply machine learning techniques as well as the heuristic rule-based method on identifying multiple document components. Combining with different heuristic rules, we extract the multiple components in a batch way by filtering out massive noises as early as possible. Our method focus on an important un-tagged document format - PDF documents. The experimental results prove the effectiveness of the sparse line analysis. © 2011 Springer-Verlag.-
dc.languageEnglish-
dc.publisherSpringer Verlag-
dc.titleAn efficient pre-processing method to identify logical components from PDF documents-
dc.typeArticle-
dc.identifier.scopusid2-s2.0-79957930143-
dc.type.rimsART-
dc.citation.volume6634 LNAI-
dc.citation.issuePART 1-
dc.citation.beginningpage500-
dc.citation.endingpage511-
dc.citation.publicationnameLECTURE NOTES IN COMPUTER SCIENCE (INCLUDING SUBSERIES LECTURE NOTES IN ARTIFICIAL INTELLIGENCE AND LECTURE NOTES IN BIOINFORMATICS)-
dc.contributor.localauthorLiu, Ying-
dc.contributor.nonIdAuthorBai, Kun-
dc.contributor.nonIdAuthorGao, Liangcai-
dc.subject.keywordAuthorBoundary Detection-
dc.subject.keywordAuthorPDF documents-
dc.subject.keywordAuthorPreprocessing-
dc.subject.keywordAuthorSparse Line Property-
dc.subject.keywordAuthorTable and Equation-
Appears in Collection
KSE-Journal Papers(저널논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0