UTIL: Unified Transliteration of Indic Languages
:NOTE: Download PDF of this article. You need good Unicode fonts to read
the article. I recommend Noto font family from Google. You can apt-get
it. IPA symbols are within square brackets, like [ʂ] and transliterated symbols
are within slashes, like /ṭ/.
UTIL is a romanization scheme for Indic languages. It is designed as pan-Indian transliteration scheme. It covers 20+ languages: Bengali, Dogra, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Lepcha, Limbu, Manipuri (Meitei), Maithili, Malayalam, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Santali, Sindhi, Sinhala, Tamil, Telugu, Urdu and probably many more.
So, why yet another scheme?
- IAST is insufficient. It serves Sanskrit and Pali but is incomplete for pretty much everything else (e.g. Bengali, Gujarati, etc.).
 - ISO–15919 is also insufficient. It ignores Kashmiri and Sindhi, which are integral Indian languages. Plus, it lacks symbols for newly-assigned Unicode codepoints (e.g. ॹ or ॺ). Also कृष्ण /Kṛṣṇa/ is typographically more consistent than /Kr̥ṣṇa/.
 - ALA-LC is designed as a single-language model ignoring the inherent similarity of Brahmic scripts. This leads to inconsistencies. For example, Tamil ழ /ḻ/ and Kannada ೞ /l̤/ correspond to same character and sound (“retorflex approximant”) and yet have different representations. Conversely, the same symbol /ṣ/ represents Hindi ष [ʂ] and Urdu ص [sˤ] even though they’re completely different sounds.
 - Other schemes like Hunterian or Gretil are as bad as the above or even worse sometimes.
 
So, how is UTIL better?
- Covers the entire character set of ISO15919 plus more (Kashmiri, Sindhi)
 - Long vowels always have macron above (ā, ē, …)
 - Aspirated consonants always have ‘h’ as second letter (kh, gh, …)
 - Minimum number of diacritical marks:
- Only four diacritics are used: “dot above”, “dot below”, “macro above”, “macron below” (or their combination).
 - Only three diacritics are needed for Sanskrit, instead of IAST’s five.
 
 - Prefers precomposed characters in Unicode repertoire, but not required.
 
Vowels
Primary vowels and diphthongs:
| अ | आ | इ | ई | उ | ऊ | ऋ | ॠ | ऌ | ॡ | ऎ | ए | ऐ | ऒ | ओ | औ | 
| a | ā | i | ī | u | ū | ṛ | ṝ | ḷ | ḹ | e | ē | ai | o | ō | au | 
Additional ones, all have a dot below:
| ॳ | ॴ | ऍ | ॲ | ऑ | ॶ | ॷ | 
| ạ | ạ̄ | ẹ | ọ | ọ̄ | ụ | ụ̄ | 
Consonants
Consonants with their Sanskrit names:
| Plosives | Nasal | Implosives | Fricatives | Vibrants | Approximants | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| स्पर्श | नासिक | विस्पर्श | ऊष्मन् | द्रव | अन्तस्थ | |||||||
| कण्ठ्य  Velar  | 
	क  k  | 
	ख  kh  | 
	ग  g  | 
	घ  gh  | 
	ङ  ṅ  | 
	ॻ  g̱  | 
	ख़  ḵẖ  | 
	ग़  ġ  | 
	ह  h  | 
	|||
| तालव्य  Palatal  | 
	च  c  | 
	छ  ch  | 
	ज  j  | 
	झ  jh  | 
	ञ  n̄  | 
	ॼ  j̱  | 
	श  ṡ  | 
	झ़  zh  | 
	य  y  | 
	य़  ẏ  | 
||
| मूर्धन्य  Retroflex  | 
	ट  ṭ  | 
	ठ  ṭh  | 
	ड  ḍ  | 
	ढ  ḍh  | 
	ण  ṇ  | 
	ष  ṣ  | 
	ड़  ṙ  | 
	ढ़  ṙh  | 
	ळ  ḻ  | 
	ऴ  ẕ  | 
||
| दन्त्य  Dental  | 
	त  t  | 
	थ  th  | 
	द  d  | 
	ध  dh  | 
	न  n  | 
	स़  s̱  | 
	ल  l  | 
	|||||
| वर्त्स्य  Alveolar  | 
	च़  ċ  | 
	छ़  ċh  | 
	ऩ  ṉ  | 
	ॾ  ḏ  | 
	स  s  | 
	ज़  z  | 
	र  r  | 
	ऱ  ṟ  | 
	||||
| ओष्ठ्य  Labial  | 
	प  p  | 
	फ  ph  | 
	ब  b  | 
	भ  bh  | 
	म  m  | 
	ॿ  ḇ  | 
	फ़  f  | 
	व  v  | 
	व़  w  | 
|||
Affricate glide ॺ (‘JJYA’) is transcribed /j̄/.
Other symbols
| Anusvāra: ṃ | Anunāsika: ̐ | Avagraha: ’ | 
| Visarga: ḥ | Jihvāmūlīya: x̣ | Upadhmānīya: ẋ | 
| Vedic Udātta: ́ | Svarita (independent): ̀ | Anudātta: ̱ | 
| Arabic hamza ء: ʼ | Arabic ain ع: ʽ | |
| Rising tone: ˊ | Falling tone: ˋ | Neutral tone: ˙ | 
Udātta and svarita use combining grave and acute accent respectively. Whereas hamza and ain use non-combining modifier letters U+02BC and U+02BD respectively. Tone modifiers are used in Maithili, Dogra and other Pahari languages.
General Notes
- Anunāsika is denoted by a combining candrabindu. Note the difference between हंस (swan) /haṃsa/ and हँस (laugh) /ha̐s/. Diacritic only on second letter in a digraph. Example: हैँ /hai̐/
 - A colon is used to denote vowel hiatus or resolve ambiguity. Example: बई /ba:i/ (not /bai/)
 
Script Notes
- ॠ /ṝ/, ऌ /ḷ/ and ॡ /ḹ/ are used only in Sanskrit.
 - ऎ /e/ = short ए in Southern scripts (எ, എ, ಎ, ఎ)
 - ऒ /o/ = short ओ in Southern scripts (ஒ, ഒ, ಒ, ఒ)
 - ऍ /ẹ/ = Gujarati ઍ, Sinhala ඇ, pronounced [æ] as in “bat”
 - ऑ /ọ̄/ = Gujarati ઑ, pronounced [ɔː] as in “ball”
 - ड़ /ṙa/ = Bengali ড়, Punjabi ੜ, Oriya ଡ଼ (RRA, “retroflex flap”)
 - ढ़ /ṙha/ = Bengali ঢ়, Oriya ଢ଼ (RHA, “aspirated retroflex flap”)
 - ळ /ḻa/ = used in Marathi, Tamil ள, Malayalam ള, Kannada ಳ, Telugu ళ (LLA, “retroflex lateral approximant”)
 - ऴ /ẕa/ = Tamil ழ, Malayalam ഴ, Kannada ೞ, Telugu ఴ (LLLA, “retroflex approximant” = zha)
 - ऩ /ṉa/ = Tamil ன , Kannada ನ಼, Malayalam ഩ (NNNA, “alveolar n”)
 - ऱ /ṟa/ = Tamil ற, Malayalam റ, Kannada ಱ, Telugu ఱ (RRA, “alveolar r”)
 - र् /ṟ/ = repha in Marathi
 - य़ /ẏa/ = য in Bengali and Oriya, while য /y/ = য়
 - व़ /wa/ = Urdu و, Assamese ৱ, Oriya ୱ
 - ख़ /ḵẖa/ = used in Urdu, Punjabi ਖ਼
 - ग़ /ġa/ = used in Urdu, Punjabi ਗ਼
 - च़ /ċa/ = used in Kashmiri, Telugu ౘ
 - ज़ /za/ = Urdu ز, Gurmukhi ਜ਼, Bengali জ়, Kannada ಜ಼, Telugu ౙ [d͡z]
 - ॹ, झ़ /zh/ = Urdu ژ, Gujarati ૹ, Avestan uses ॹ
 - Kashmiri vowels ạ, ạ̄ , ọ, ọ̄ , ụ, ụ̄ are pronounced [ə], [əː], [ɔ], [ɔː], [ɨ], [ɨː]. Another vowel form, ॵ , is sometimes used for [ɔ] (and [ɔː] is skipped altogether). These symbols are taken from ALA-LC as it is and follows Wikipedia. Kashmiri consonants: च़ [t͡s], छ़ [t͡sʰ] and ज़ [z]
 - Sindhi implosives: ॻ /g̱/, ॼ /j̱/, ॾ /ḏ/ and ॿ /ḇ/
 - Sinhalese nasals: ඟ /ṇ̄ga/, ඥ /ṇ̄ja/, ඬ /ṇ̄da/, ඳ /ṇ̄ḍa/ and ඹ /ṃ̄ba/
 - Sinhala long vowel ඈ and Devanagari vowel sign candra long E (U+0955), used in Avestan, are transliterated /ẹ̄/
 - The six Malayalam chillu characters represent dead consonants (without implicit vowel). As such, the are simply transliterated without adding an a next to the consonant. Hence, ൿ, ൽ, ൾ, ൻ, ൺ and ർ are respectively transliterated as /k-/, /l-/, /ll-/, /n-/, /nn-/ and /rr-/.
 
Urdu
Perso-Arabic characters are chosen in a non-conflicting way with the Brahmic scripts. Urdu introduces six sounds [f, z, ʒ, q, x, ɣ] on top of Hindi (see Hindustani phonology). Note that [f, z, x, ɣ] are fricatives, just like ष [ʂ], स [s], ह [h]. Excluding these IPA signs, the ones in the below table are indicative only.
| Urdu | ق | ح | خ | ء | ع | غ | ط | ظ | ز | ذ | ض | ص | ث | ش | ژ | ف | و | 
| UTIL | q | ẖ | ḵẖ | ʼ | ʽ | ġ | ṯ | ẓ | z | z̄ | ż | s̄ | s̱ | ṡ | zh | f | w | 
| IPA | [q] | [ɦ] | [x] | [ʔ] | [ʕ] | [ɣ] | [tˤ] | [zˤ] | [z] | [ð] | [dˤ] | [sˤ] | [θ] | [ʃ] | [ʒ] | [f] | [w] | 
| Devanagari | क़ | ह़ | ख़ | ॽ | ॽ़ | ग़ | त़ | ज़ | ज़ | ज़ | ज़ | स़ | स़ | श | झ़ | फ़ | व़ | 
Input methods for IME:s
Of course, a transliteration scheme is not so useful if it cannot be entered into a computer, for which Input Method Editors (IMEs) are used. This can be thought of as an ASCII transliteration of UTIL.
Emacs input method can be found indic-roman-postfix.el, which is a postfix input method (i.e., diacritics are entered after the character).
indic-util.mim is an m17n input method can be used with many IME’s based on libm17n like iBus, uim and fcitx.