Creator of Knowledge
Information Resources Management Association
Advancing the Concepts & Practices of Information Resources Management in Modern Organizations

Exploring the Potential of an Extensible Domain-Specific Web Corpus for “Layfication”: The Case of Cross-Lingual Classification

Exploring the Potential of an Extensible Domain-Specific Web Corpus for “Layfication”: The Case of Cross-Lingual Classification
View Sample PDF
Author(s): Marina Santini (RISE Research Institutes of Sweden, Sweden) and Min-Chun Shih (Linköping University, Sweden)
Copyright: 2020
Volume: 2
Issue: 1
Pages: 13
Source title: International Journal of Cyber-Physical Systems (IJCPS)
Editor(s)-in-Chief: Amjad Gawanmeh (University of Dubai, United Arab Emirates)
DOI: 10.4018/IJCPS.2020010102



This article presents experiments based on the extensible domain-specific web corpus for “layfication”. For these experiments, both the existing layfication corpus (in Swedish and in English) and a new addition in English (the NHS-PubMed subcorpus) are used. With this extended corpus, methods to classify lay-specialized medical sublanguages cross-linguistically using small data and noisy web documents are investigated. Sublanguage is a language variety used in specific domains. Here, the authors focus on two medical sublanguages, namely the “patientspeak” (lay) and the medical jargon (specialized). Cross-lingual sublanguage classification is still largely underexplored although it can be crucial in downstream applications for digital health and cyber-physical systems. Classification models are built using small and noisy training sets in Swedish and evaluated on English test sets. The performance of Naive Bayes classifiers—built with stopwords and with Bag-of-Words—is compared with convolutional neural network classifiers leveraging on MUSE multi-lingual word embeddings. Results are promising and nuanced. These results are proposed as a first baseline for cross-lingual sublanguage classification.

Related Content

Alexander Shamliev, Peter Mitrouchev, Maya Dimitrova. © 2020. 19 pages.
Marina Santini, Min-Chun Shih. © 2020. 13 pages.
Zhijing Ye, Fei Hu, Lin Zhang, Zhe Chu, Zheng O'Neill. © 2020. 23 pages.
Sumit Kumar, Zahid Raza. © 2019. 14 pages.
Urooj Raza Khan, Christopher Pearce, Tanveer Zia, Kaushalya Perera. © 2019. 20 pages.
Ali Ahmadinia. © 2019. 10 pages.
Laszlo Z. Varga. © 2019. 26 pages.
Body Bottom