DanTok: Domain Beats Language for Danish Social Media POS Tagging

Kia Kirstein Hansen; Maria Barrett; Max Müller-Eberstein; Cathrine Damgaard; Trine Eriksen; Rob Van Der Goot

DanTok: Domain Beats Language for Danish Social Media POS Tagging

Kia Kirstein Hansen, Maria Barrett, Max Müller-Eberstein, Cathrine Damgaard, Trine Eriksen, Rob van der Goot

Abstract

Language from social media remains challenging to process automatically, especially for non-English languages. In this work, we introduce the first NLP dataset for TikTok comments and the first Danish social media dataset with part-of-speech annotation. We further supply annotations for normalization, code-switching, and annotator uncertainty. As transferring models to such a highly specialized domain is non-trivial, we conduct an extensive study into which source data and modeling decisions most impact the performance. Surprisingly, transferring from in-domain data, even from a different language, outperforms in-language, out-of-domain training. These benefits nonetheless rely on the underlying language models having been at least partially pre-trained on data from the target language. Using our additional annotation layers, we further analyze how normalization, code-switching, and human uncertainty affect the tagging accuracy.

Anthology ID:: 2023.nodalida-1.27
Volume:: Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Month:: May
Year:: 2023
Address:: Tórshavn, Faroe Islands
Editors:: Tanel Alumäe, Mark Fishel
Venue:: NoDaLiDa
SIG:
Publisher:: University of Tartu Library
Note:
Pages:: 271–279
Language:
URL:: https://aclanthology.org/2023.nodalida-1.27
DOI:
Bibkey:
Cite (ACL):: Kia Kirstein Hansen, Maria Barrett, Max Müller-Eberstein, Cathrine Damgaard, Trine Eriksen, and Rob van der Goot. 2023. DanTok: Domain Beats Language for Danish Social Media POS Tagging. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 271–279, Tórshavn, Faroe Islands. University of Tartu Library.
Cite (Informal):: DanTok: Domain Beats Language for Danish Social Media POS Tagging (Kirstein Hansen et al., NoDaLiDa 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.nodalida-1.27.pdf

PDF Cite Search