Specific Challenges of Variation and Text Types
Document Type
Article
Publication Title
Synthesis Lectures on Human Language Technologies
Abstract
One fascinating aspect of language identification which makes it difficult is the similarity between languages. Some languages seem to be extremely easy to distinguish from each other, whereas for some others, it is extremely difficult. This phenomenon is closely tied to the definition of “language”, which is much less trivial than what one might think. It is hard to draw the line between languages and dialects. For example, mutual intelligibility is one of the measures often mentioned, but this is highly subjective and very difficult to measure objectively. Several organizations have defined lists of languages. Ethnologue: Languages of the World is currently in its 25th edition, and lists 7,168 known living languages. It is published by the SIL International, which is also responsible for the ISO 639-3 standard consisting of three-letter codes representing individual languages. Library of Congress is the registration authority for the ISO 639-2 standard consisting of the ISO 639-3 compatible three-letter codes for a considerably smaller number of languages, still continuously updated as well. Glottolog, published by the Max Planck Institute, lists 8,572 entries in its version 4.7. Linguasphere Register volume two includes over 30,000 languages and dialects. Of these lists, ISO 639-3 and its subset ISO 639-2 are the most widely used even though the two-letter codes from ISO 639-1 are still in use on many occasions.
First Page
99
Last Page
115
DOI
10.1007/978-3-031-45822-4_4
Publication Date
1-2-2024
Keywords
Language identification, Library of congress, Max Planck Institute, Registration Authority
Recommended Citation
T. Jauhiainen et al., "Specific Challenges of Variation and Text Types," Synthesis Lectures on Human Language Technologies, vol. Part F2039, pp. 99 - 115, Jan 2024.
The definitive version is available at https://doi.org/10.1007/978-3-031-45822-4_4
Comments
IR conditions: non-described