Large Scale, Multi-domain Language Identification
Document Type
Article
Publication Title
Synthesis Lectures on Human Language Technologies
Abstract
In general, the more recognizable languages there are, the more difficult it is to recognize the language (Brown 2012; Rodrigues 2012; Jauhiainen et al. 2017a). It is intuitively easy to understand that if classes are added, the classification becomes more difficult. However, this depends in part on the evaluation measures used. For example, if the average accuracy of all languages is measured, it may improve when easily distinguishable languages are added to the language selection. Brown (2014) presents results where the average accuracy is higher for 1366 languages than for a subset of 781 languages. He explains this phenomenon by the fact that a larger proportion of languages in a smaller repertoire are based on Wikipedia texts, which are often multilingual, containing lots of texts in unintended languages. Most language identification research has focused on a relatively small number of languages. In Table 5.1, we have listed references that have empirically tested language identifiers with 100 or more languages.
First Page
117
Last Page
135
DOI
10.1007/978-3-031-45822-4_5
Publication Date
1-2-2024
Keywords
Artificial intelligence, Domain language, Evaluation measures, Language identification, Large-scales, Multi-domains, Recognizable languages, Wikipedia
Recommended Citation
T. Jauhiainen et al., "Large Scale, Multi-domain Language Identification," Synthesis Lectures on Human Language Technologies, vol. Part F2039, pp. 117 - 135, Jan 2024.
The definitive version is available at https://doi.org/10.1007/978-3-031-45822-4_5
Comments
IR conditions: non-described