UTIL: Unified Transliteration of Indic Languages
:NOTE: Download PDF of this article. You need good Unicode fonts to read
the article. I recommend Noto font family from Google. You can apt-get
it. IPA symbols are within square brackets, like [ʂ] and transliterated symbols
are within slashes, like /ṭ/.
UTIL is a romanization scheme for Indic languages. It is designed as pan-Indian transliteration scheme. It covers 20+ languages: Bengali, Dogra, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Lepcha, Limbu, Manipuri (Meitei), Maithili, Malayalam, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Santali, Sindhi, Sinhala, Tamil, Telugu, Urdu and probably many more.
So, why yet another scheme?
- IAST is insufficient. It serves Sanskrit and Pali but is incomplete for pretty much everything else (e.g. Bengali, Gujarati, etc.).
- ISO–15919 is also insufficient. It ignores Kashmiri and Sindhi, which are integral Indian languages. Plus, it lacks symbols for newly-assigned Unicode codepoints (e.g. ॹ or ॺ). Also कृष्ण /Kṛṣṇa/ is typographically more consistent than /Kr̥ṣṇa/.
- ALA-LC is designed as a single-language model ignoring the inherent similarity of Brahmic scripts. This leads to inconsistencies. For example, Tamil ழ /ḻ/ and Kannada ೞ /l̤/ correspond to same character and sound (“retorflex approximant”) and yet have different representations. Conversely, the same symbol /ṣ/ represents Hindi ष [ʂ] and Urdu ص [sˤ] even though they’re completely different sounds.
- Other schemes like Hunterian or Gretil are as bad as the above or even worse sometimes.
So, how is UTIL better?
- Covers the entire character set of ISO15919 plus more (Kashmiri, Sindhi)
- Long vowels always have macron above (ā, ē, …)
- Aspirated consonants always have ‘h’ as second letter (kh, gh, …)
- Minimum number of diacritical marks:
- Only four diacritics are used: “dot above”, “dot below”, “macro above”, “macron below” (or their combination).
- Only three diacritics are needed for Sanskrit, instead of IAST’s five.
- Prefers precomposed characters in Unicode repertoire, but not required.
Vowels
Primary vowels and diphthongs:
अ | आ | इ | ई | उ | ऊ | ऋ | ॠ | ऌ | ॡ | ऎ | ए | ऐ | ऒ | ओ | औ |
a | ā | i | ī | u | ū | ṛ | ṝ | ḷ | ḹ | e | ē | ai | o | ō | au |
Additional ones, all have a dot below:
ॳ | ॴ | ऍ | ॲ | ऑ | ॶ | ॷ |
ạ | ạ̄ | ẹ | ọ | ọ̄ | ụ | ụ̄ |
Consonants
Consonants with their Sanskrit names:
Plosives | Nasal | Implosives | Fricatives | Vibrants | Approximants | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
स्पर्श | नासिक | विस्पर्श | ऊष्मन् | द्रव | अन्तस्थ | |||||||
कण्ठ्य Velar |
क k |
ख kh |
ग g |
घ gh |
ङ ṅ |
ॻ g̱ |
ख़ ḵẖ |
ग़ ġ |
ह h |
|||
तालव्य Palatal |
च c |
छ ch |
ज j |
झ jh |
ञ n̄ |
ॼ j̱ |
श ṡ |
झ़ zh |
य y |
य़ ẏ |
||
मूर्धन्य Retroflex |
ट ṭ |
ठ ṭh |
ड ḍ |
ढ ḍh |
ण ṇ |
ष ṣ |
ड़ ṙ |
ढ़ ṙh |
ळ ḻ |
ऴ ẕ |
||
दन्त्य Dental |
त t |
थ th |
द d |
ध dh |
न n |
स़ s̱ |
ल l |
|||||
वर्त्स्य Alveolar |
च़ ċ |
छ़ ċh |
ऩ ṉ |
ॾ ḏ |
स s |
ज़ z |
र r |
ऱ ṟ |
||||
ओष्ठ्य Labial |
प p |
फ ph |
ब b |
भ bh |
म m |
ॿ ḇ |
फ़ f |
व v |
व़ w |
Affricate glide ॺ (‘JJYA’) is transcribed /j̄/.
Other symbols
Anusvāra: ṃ | Anunāsika: ̐ | Avagraha: ’ |
Visarga: ḥ | Jihvāmūlīya: x̣ | Upadhmānīya: ẋ |
Vedic Udātta: ́ | Svarita (independent): ̀ | Anudātta: ̱ |
Arabic hamza ء: ʼ | Arabic ain ع: ʽ | |
Rising tone: ˊ | Falling tone: ˋ | Neutral tone: ˙ |
Udātta and svarita use combining grave and acute accent respectively. Whereas hamza and ain use non-combining modifier letters U+02BC and U+02BD respectively. Tone modifiers are used in Maithili, Dogra and other Pahari languages.
General Notes
- Anunāsika is denoted by a combining candrabindu. Note the difference between हंस (swan) /haṃsa/ and हँस (laugh) /ha̐s/. Diacritic only on second letter in a digraph. Example: हैँ /hai̐/
- A colon is used to denote vowel hiatus or resolve ambiguity. Example: बई /ba:i/ (not /bai/)
Script Notes
- ॠ /ṝ/, ऌ /ḷ/ and ॡ /ḹ/ are used only in Sanskrit.
- ऎ /e/ = short ए in Southern scripts (எ, എ, ಎ, ఎ)
- ऒ /o/ = short ओ in Southern scripts (ஒ, ഒ, ಒ, ఒ)
- ऍ /ẹ/ = Gujarati ઍ, Sinhala ඇ, pronounced [æ] as in “bat”
- ऑ /ọ̄/ = Gujarati ઑ, pronounced [ɔː] as in “ball”
- ड़ /ṙa/ = Bengali ড়, Punjabi ੜ, Oriya ଡ଼ (RRA, “retroflex flap”)
- ढ़ /ṙha/ = Bengali ঢ়, Oriya ଢ଼ (RHA, “aspirated retroflex flap”)
- ळ /ḻa/ = used in Marathi, Tamil ள, Malayalam ള, Kannada ಳ, Telugu ళ (LLA, “retroflex lateral approximant”)
- ऴ /ẕa/ = Tamil ழ, Malayalam ഴ, Kannada ೞ, Telugu ఴ (LLLA, “retroflex approximant” = zha)
- ऩ /ṉa/ = Tamil ன , Kannada ನ಼, Malayalam ഩ (NNNA, “alveolar n”)
- ऱ /ṟa/ = Tamil ற, Malayalam റ, Kannada ಱ, Telugu ఱ (RRA, “alveolar r”)
- र् /ṟ/ = repha in Marathi
- य़ /ẏa/ = য in Bengali and Oriya, while য /y/ = য়
- व़ /wa/ = Urdu و, Assamese ৱ, Oriya ୱ
- ख़ /ḵẖa/ = used in Urdu, Punjabi ਖ਼
- ग़ /ġa/ = used in Urdu, Punjabi ਗ਼
- च़ /ċa/ = used in Kashmiri, Telugu ౘ
- ज़ /za/ = Urdu ز, Gurmukhi ਜ਼, Bengali জ়, Kannada ಜ಼, Telugu ౙ [d͡z]
- ॹ, झ़ /zh/ = Urdu ژ, Gujarati ૹ, Avestan uses ॹ
- Kashmiri vowels ạ, ạ̄ , ọ, ọ̄ , ụ, ụ̄ are pronounced [ə], [əː], [ɔ], [ɔː], [ɨ], [ɨː]. Another vowel form, ॵ , is sometimes used for [ɔ] (and [ɔː] is skipped altogether). These symbols are taken from ALA-LC as it is and follows Wikipedia. Kashmiri consonants: च़ [t͡s], छ़ [t͡sʰ] and ज़ [z]
- Sindhi implosives: ॻ /g̱/, ॼ /j̱/, ॾ /ḏ/ and ॿ /ḇ/
- Sinhalese nasals: ඟ /ṇ̄ga/, ඥ /ṇ̄ja/, ඬ /ṇ̄da/, ඳ /ṇ̄ḍa/ and ඹ /ṃ̄ba/
- Sinhala long vowel ඈ and Devanagari vowel sign candra long E (U+0955), used in Avestan, are transliterated /ẹ̄/
- The six Malayalam chillu characters represent dead consonants (without implicit vowel). As such, the are simply transliterated without adding an a next to the consonant. Hence, ൿ, ൽ, ൾ, ൻ, ൺ and ർ are respectively transliterated as /k-/, /l-/, /ll-/, /n-/, /nn-/ and /rr-/.
Urdu
Perso-Arabic characters are chosen in a non-conflicting way with the Brahmic scripts. Urdu introduces six sounds [f, z, ʒ, q, x, ɣ] on top of Hindi (see Hindustani phonology). Note that [f, z, x, ɣ] are fricatives, just like ष [ʂ], स [s], ह [h]. Excluding these IPA signs, the ones in the below table are indicative only.
Urdu | ق | ح | خ | ء | ع | غ | ط | ظ | ز | ذ | ض | ص | ث | ش | ژ | ف | و |
UTIL | q | ẖ | ḵẖ | ʼ | ʽ | ġ | ṯ | ẓ | z | z̄ | ż | s̄ | s̱ | ṡ | zh | f | w |
IPA | [q] | [ɦ] | [x] | [ʔ] | [ʕ] | [ɣ] | [tˤ] | [zˤ] | [z] | [ð] | [dˤ] | [sˤ] | [θ] | [ʃ] | [ʒ] | [f] | [w] |
Devanagari | क़ | ह़ | ख़ | ॽ | ॽ़ | ग़ | त़ | ज़ | ज़ | ज़ | ज़ | स़ | स़ | श | झ़ | फ़ | व़ |
Input methods for IME:s
Of course, a transliteration scheme is not so useful if it cannot be entered into a computer, for which Input Method Editors (IMEs) are used. This can be thought of as an ASCII transliteration of UTIL.
Emacs input method can be found indic-roman-postfix.el, which is a postfix input method (i.e., diacritics are entered after the character).
indic-util.mim is an m17n input method can be used with many IME’s based on libm17n like iBus, uim and fcitx.