IRMA-International.org: Creator of Knowledge
Information Resources Management Association
Advancing the Concepts & Practices of Information Resources Management in Modern Organizations

Implicit Semantics Based Metadata Extraction and Matching of Scholarly Documents

Implicit Semantics Based Metadata Extraction and Matching of Scholarly Documents
View Sample PDF
Author(s): Congfeng Jiang (School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China), Junming Liu (School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China), Dongyang Ou (School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China), Yumei Wang (School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China)and Lifeng Yu (Hithink RoyalFlush Information Network Co., Ltd., Hangzhou, China)
Copyright: 2018
Volume: 29
Issue: 2
Pages: 22
Source title: Journal of Database Management (JDM)
Editor(s)-in-Chief: Keng Siau (City University of Hong Kong, Hong Kong SAR)
DOI: 10.4018/JDM.2018040101

Purchase

View Implicit Semantics Based Metadata Extraction and Matching of Scholarly Documents on the publisher's website for pricing and purchasing information.

Abstract

The authors propose to use formatting templates and implicit formatting semantics information for automatic metadata identification and segmentation. The pure texts and their corresponding formatting information including line height, font type, and font size, are recognized in parallel to guide metadata identification. The authors use implicit formatting semantics, such as the change of formatting, formatting templates and implications, explicit formatting layouts, as well as predefined frequently occurred keywords database to increase the extraction accuracy. Unlike other OCR-based approaches, the authors use open source PDFBox package as the basic preprocessing tool to get pure texts and formatting values of the document contents. On top of PDFBox they built their own pipeline program, namely, PAXAT, to implement their approaches for metadata extraction. 10177 papers from arXiv, ACM, ACL and other publicly accessed and institution-subscribed sources are tested. The overall extraction accuracy of title, authors, affiliations, author-affiliation matching are 0.9798, 0.9425, 0.9298, and 0.9109, respectively.

Related Content

Pasi Raatikainen, Samuli Pekkola, Maria Mäkelä. © 2024. 30 pages.
Zhongliang Li, Yaofeng Tu, Zongmin Ma. © 2024. 25 pages.
Jizi Li, Xiaodie Wang, Justin Z. Zhang, Longyu Li. © 2024. 34 pages.
Lavlin Agrawal, Pavankumar Mulgund, Raj Sharman. © 2024. 37 pages.
Ruizhe Ma, Weiwei Zhou, Zongmin Ma. © 2024. 21 pages.
Zongmin Ma, Daiyi Li, Jiawen Lu, Ruizhe Ma, Li Yan. © 2024. 32 pages.
Amit Singh, Jay Prakash, Gaurav Kumar, Praphula Kumar Jain, Loknath Sai Ambati. © 2024. 25 pages.
Body Bottom