Large Scale, Multi-domain Language Identification

Document Type

Article

Publication Title

Synthesis Lectures on Human Language Technologies

Abstract

In general, the more recognizable languages there are, the more difficult it is to recognize the language (Brown 2012; Rodrigues 2012; Jauhiainen et al. 2017a). It is intuitively easy to understand that if classes are added, the classification becomes more difficult. However, this depends in part on the evaluation measures used. For example, if the average accuracy of all languages is measured, it may improve when easily distinguishable languages are added to the language selection. Brown (2014) presents results where the average accuracy is higher for 1366 languages than for a subset of 781 languages. He explains this phenomenon by the fact that a larger proportion of languages in a smaller repertoire are based on Wikipedia texts, which are often multilingual, containing lots of texts in unintended languages. Most language identification research has focused on a relatively small number of languages. In Table 5.1, we have listed references that have empirically tested language identifiers with 100 or more languages.

First Page

117

Last Page

135

DOI

10.1007/978-3-031-45822-4_5

Publication Date

1-2-2024

Keywords

Artificial intelligence, Domain language, Evaluation measures, Language identification, Large-scales, Multi-domains, Recognizable languages, Wikipedia

Comments

IR conditions: non-described

Share

COinS