Specific Challenges of Variation and Text Types

Document Type


Publication Title

Synthesis Lectures on Human Language Technologies


One fascinating aspect of language identification which makes it difficult is the similarity between languages. Some languages seem to be extremely easy to distinguish from each other, whereas for some others, it is extremely difficult. This phenomenon is closely tied to the definition of “language”, which is much less trivial than what one might think. It is hard to draw the line between languages and dialects. For example, mutual intelligibility is one of the measures often mentioned, but this is highly subjective and very difficult to measure objectively. Several organizations have defined lists of languages. Ethnologue: Languages of the World is currently in its 25th edition, and lists 7,168 known living languages. It is published by the SIL International, which is also responsible for the ISO 639-3 standard consisting of three-letter codes representing individual languages. Library of Congress is the registration authority for the ISO 639-2 standard consisting of the ISO 639-3 compatible three-letter codes for a considerably smaller number of languages, still continuously updated as well. Glottolog, published by the Max Planck Institute, lists 8,572 entries in its version 4.7. Linguasphere Register volume two includes over 30,000 languages and dialects. Of these lists, ISO 639-3 and its subset ISO 639-2 are the most widely used even though the two-letter codes from ISO 639-1 are still in use on many occasions.

First Page


Last Page




Publication Date



Language identification, Library of congress, Max Planck Institute, Registration Authority


IR conditions: non-described