DSpace at KOASAS: An efficient pre-processing method to identify logical components from PDF documents

DSpace at KOASAS

College of Engineering(공과대학)Dept. of Nuclear and Quantum Engineering(원자력및양자공학과)NE-Journal Papers(저널논문)

An efficient pre-processing method to identify logical components from PDF documents

Cited 0 time in webofscience

Cited 0 time in

Hit : 820
Download : 0

Export

DC Field	Value	Language
dc.contributor.author	Liu, Ying	ko
dc.contributor.author	Bai, Kun	ko
dc.contributor.author	Gao, Liangcai	ko
dc.date.accessioned	2013-03-11T17:47:22Z	-
dc.date.available	2013-03-11T17:47:22Z	-
dc.date.created	2012-02-06	-
dc.date.created	2012-02-06	-
dc.date.issued	2011	-
dc.identifier.citation	LECTURE NOTES IN COMPUTER SCIENCE (INCLUDING SUBSERIES LECTURE NOTES IN ARTIFICIAL INTELLIGENCE AND LECTURE NOTES IN BIOINFORMATICS), v.6634 LNAI, no.PART 1, pp.500 - 511	-
dc.identifier.issn	0302-9743	-
dc.identifier.uri	http://hdl.handle.net/10203/99775	-
dc.description.abstract	As the rapid growth of the scientific documents in digital libraries, the search demands for the documents as well as specific components increase dramatically. Accurately detecting the component boundary is of vital importance to all the further information extraction and applications. However, document component boundary detection (especially the table, figure, and equation) is a challenging problem because there is no standardized formats and layouts across diverse documents. This paper presents an efficient document preprocessing technique to improve the document component boundary detection performance by taking advantage of the nature of document lines. Our method easily simplifies the component boundary detection problem into the sparse line analysis problem with much less noise. We define eight document line label types and apply machine learning techniques as well as the heuristic rule-based method on identifying multiple document components. Combining with different heuristic rules, we extract the multiple components in a batch way by filtering out massive noises as early as possible. Our method focus on an important un-tagged document format - PDF documents. The experimental results prove the effectiveness of the sparse line analysis. © 2011 Springer-Verlag.	-
dc.language	English	-
dc.publisher	Springer Verlag	-
dc.title	An efficient pre-processing method to identify logical components from PDF documents	-
dc.type	Article	-
dc.identifier.scopusid	2-s2.0-79957930143	-
dc.type.rims	ART	-
dc.citation.volume	6634 LNAI	-
dc.citation.issue	PART 1	-
dc.citation.beginningpage	500	-
dc.citation.endingpage	511	-
dc.citation.publicationname	LECTURE NOTES IN COMPUTER SCIENCE (INCLUDING SUBSERIES LECTURE NOTES IN ARTIFICIAL INTELLIGENCE AND LECTURE NOTES IN BIOINFORMATICS)	-
dc.contributor.localauthor	Liu, Ying	-
dc.contributor.nonIdAuthor	Bai, Kun	-
dc.contributor.nonIdAuthor	Gao, Liangcai	-
dc.subject.keywordAuthor	Boundary Detection	-
dc.subject.keywordAuthor	PDF documents	-
dc.subject.keywordAuthor	Preprocessing	-
dc.subject.keywordAuthor	Sparse Line Property	-
dc.subject.keywordAuthor	Table and Equation	-

Appears in Collection: KSE-Journal Papers(저널논문)

Files in This Item: There are no files associated with this item.

Display Simple Item Record

qr_code

트윗하기

KOASAS

Knowledge Service Development Team, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. T. 82-42-350-4493 Email. koasas@kaist.ac.kr
Copyright © 2016. Korea Advanced Institute of Science and Technology. All Rights Reserved.

KOASAS

KOASAS

Browse

An efficient pre-processing method to identify logical components from PDF documents

KOASAS

Communities & Collections