Exploring the Potential of an Extensible Domain-Specific Web Corpus for “Layfication”: The Case of Cross-Lingual Classification

View Sample PDF

Author(s): Marina Santini (RISE Research Institutes of Sweden, Sweden)and Min-Chun Shih (Linköping University, Sweden)
Copyright: 2020
Volume: 2
Issue: 1
Pages: 13
Source title: International Journal of Cyber-Physical Systems (IJCPS)
Editor(s)-in-Chief: Amjad Gawanmeh (University of Dubai, United Arab Emirates)
DOI: 10.4018/IJCPS.2020010102

Keywords: Cybernetics / Information Science Reference / Media & Communications / Network Architecture

Purchase

View Exploring the Potential of an Extensible Domain-Specific Web Corpus for “Layfication”: The Case of Cross-Lingual Classification on the publisher's website for pricing and purchasing information.

Abstract

This article presents experiments based on the extensible domain-specific web corpus for “layfication”. For these experiments, both the existing layfication corpus (in Swedish and in English) and a new addition in English (the NHS-PubMed subcorpus) are used. With this extended corpus, methods to classify lay-specialized medical sublanguages cross-linguistically using small data and noisy web documents are investigated. Sublanguage is a language variety used in specific domains. Here, the authors focus on two medical sublanguages, namely the “patientspeak” (lay) and the medical jargon (specialized). Cross-lingual sublanguage classification is still largely underexplored although it can be crucial in downstream applications for digital health and cyber-physical systems. Classification models are built using small and noisy training sets in Swedish and evaluated on English test sets. The performance of Naive Bayes classifiers—built with stopwords and with Bag-of-Words—is compared with convolutional neural network classifiers leveraging on MUSE multi-lingual word embeddings. Results are promising and nuanced. These results are proposed as a first baseline for cross-lingual sublanguage classification.

The IRMA Community

Research IRM

Exploring the Potential of an Extensible Domain-Specific Web Corpus for “Layfication”: The Case of Cross-Lingual Classification

Purchase

Abstract

Related Content

IRMA Sponsors