UTIL: Unified Transliteration of Indic Languages

:NOTE: Download PDF of this article. You need good Unicode fonts to read the article. I recommend Noto font family from Google. You can apt-get it. IPA symbols are within square brackets, like [ʂ] and transliterated symbols are within slashes, like /ṭ/.

UTIL is a romanization scheme for Indic languages. It is designed as pan-Indian transliteration scheme. It covers 20+ languages: Bengali, Dogra, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Lepcha, Limbu, Manipuri (Meitei), Maithili, Malayalam, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Santali, Sindhi, Sinhala, Tamil, Telugu, Urdu and probably many more.

So, why yet another scheme?

IAST is insufficient. It serves Sanskrit and Pali but is incomplete for pretty much everything else (e.g. Bengali, Gujarati, etc.).
ISO–15919 is also insufficient. It ignores Kashmiri and Sindhi, which are integral Indian languages. Plus, it lacks symbols for newly-assigned Unicode codepoints (e.g. ॹ or ॺ). Also कृष्ण /Kṛṣṇa/ is typographically more consistent than /Kr̥ṣṇa/.
ALA-LC is designed as a single-language model ignoring the inherent similarity of Brahmic scripts. This leads to inconsistencies. For example, Tamil ழ /ḻ/ and Kannada ೞ /l̤/ correspond to same character and sound (“retorflex approximant”) and yet have different representations. Conversely, the same symbol /ṣ/ represents Hindi ष [ʂ] and Urdu ص [sˤ] even though they’re completely different sounds.
Other schemes like Hunterian or Gretil are as bad as the above or even worse sometimes.

So, how is UTIL better?

Covers the entire character set of ISO15919 plus more (Kashmiri, Sindhi)
Long vowels always have macron above (ā, ē, …)
Aspirated consonants always have ‘h’ as second letter (kh, gh, …)
Minimum number of diacritical marks:
- Only four diacritics are used: “dot above”, “dot below”, “macro above”, “macron below” (or their combination).
- Only three diacritics are needed for Sanskrit, instead of IAST’s five.
Prefers precomposed characters in Unicode repertoire, but not required.

Vowels

Primary vowels and diphthongs:

अ

आ

इ

ई

उ

ऊ

ऋ

ॠ

ऌ

ॡ

ऎ

ए

ऐ

ऒ

ओ

औ

ṛ

ṝ

ḷ

ḹ

Additional ones, all have a dot below:

ॳ	ॴ	ऍ	ॲ	ऑ	ॶ	ॷ
ạ	ạ̄	ẹ	ọ	ọ̄	ụ	ụ̄

Consonants

Consonants with their Sanskrit names:

	Plosives				Nasal	Implosives	Fricatives		Vibrants		Approximants
	स्पर्श				नासिक	विस्पर्श	ऊष्मन्		द्रव		अन्तस्थ
कण्ठ्य Velar	क k	ख kh	ग g	घ gh	ङ ṅ	ॻ g̱	ख़ ḵẖ	ग़ ġ			ह h
तालव्य Palatal	च c	छ ch	ज j	झ jh	ञ n̄	ॼ j̱	श ṡ	झ़ zh			य y	य़ ẏ
मूर्धन्य Retroflex	ट ṭ	ठ ṭh	ड ḍ	ढ ḍh	ण ṇ		ष ṣ		ड़ ṙ	ढ़ ṙh	ळ ḻ	ऴ ẕ
दन्त्य Dental	त t	थ th	द d	ध dh	न n		स़ s̱				ल l
वर्त्स्य Alveolar	च़ ċ	छ़ ċh			ऩ ṉ	ॾ ḏ	स s	ज़ z	र r	ऱ ṟ
ओष्ठ्य Labial	प p	फ ph	ब b	भ bh	म m	ॿ ḇ	फ़ f				व v	व़ w

Affricate glide ॺ (‘JJYA’) is transcribed /j̄/.

Other symbols

Anusvāra: ṃ	Anunāsika: ̐	Avagraha: ’
Visarga: ḥ	Jihvāmūlīya: x̣	Upadhmānīya: ẋ
Vedic Udātta: ́	Svarita (independent): ̀	Anudātta: ̱
Arabic hamza ء: ʼ	Arabic ain ع: ʽ
Rising tone: ˊ	Falling tone: ˋ	Neutral tone: ˙

Udātta and svarita use combining grave and acute accent respectively. Whereas hamza and ain use non-combining modifier letters U+02BC and U+02BD respectively. Tone modifiers are used in Maithili, Dogra and other Pahari languages.

General Notes

Anunāsika is denoted by a combining candrabindu. Note the difference between हंस (swan) /haṃsa/ and हँस (laugh) /ha̐s/. Diacritic only on second letter in a digraph. Example: हैँ /hai̐/
A colon is used to denote vowel hiatus or resolve ambiguity. Example: बई /ba:i/ (not /bai/)

Script Notes

ॠ /ṝ/, ऌ /ḷ/ and ॡ /ḹ/ are used only in Sanskrit.
ऎ /e/ = short ए in Southern scripts (எ, എ, ಎ, ఎ)
ऒ /o/ = short ओ in Southern scripts (ஒ, ഒ, ಒ, ఒ)
ऍ /ẹ/ = Gujarati ઍ, Sinhala ඇ, pronounced [æ] as in “bat”
ऑ /ọ̄/ = Gujarati ઑ, pronounced [ɔː] as in “ball”
ड़ /ṙa/ = Bengali ড়, Punjabi ੜ, Oriya ଡ଼ (RRA, “retroflex flap”)
ढ़ /ṙha/ = Bengali ঢ়, Oriya ଢ଼ (RHA, “aspirated retroflex flap”)
ळ /ḻa/ = used in Marathi, Tamil ள, Malayalam ള, Kannada ಳ, Telugu ళ (LLA, “retroflex lateral approximant”)
ऴ /ẕa/ = Tamil ழ, Malayalam ഴ, Kannada ೞ, Telugu ఴ (LLLA, “retroflex approximant” = zha)
ऩ /ṉa/ = Tamil ன , Kannada ನ಼, Malayalam ഩ (NNNA, “alveolar n”)
ऱ /ṟa/ = Tamil ற, Malayalam റ, Kannada ಱ, Telugu ఱ (RRA, “alveolar r”)
र् /ṟ/ = repha in Marathi
य़ /ẏa/ = য in Bengali and Oriya, while য /y/ = য়
व़ /wa/ = Urdu و, Assamese ৱ, Oriya ୱ
ख़ /ḵẖa/ = used in Urdu, Punjabi ਖ਼
ग़ /ġa/ = used in Urdu, Punjabi ਗ਼
च़ /ċa/ = used in Kashmiri, Telugu ౘ
ज़ /za/ = Urdu ز, Gurmukhi ਜ਼, Bengali জ়, Kannada ಜ಼, Telugu ౙ [d͡z]
ॹ, झ़ /zh/ = Urdu ژ, Gujarati ૹ, Avestan uses ॹ
Kashmiri vowels ạ, ạ̄ , ọ, ọ̄ , ụ, ụ̄ are pronounced [ə], [əː], [ɔ], [ɔː], [ɨ], [ɨː]. Another vowel form, ॵ , is sometimes used for [ɔ] (and [ɔː] is skipped altogether). These symbols are taken from ALA-LC as it is and follows Wikipedia. Kashmiri consonants: च़ [t͡s], छ़ [t͡sʰ] and ज़ [z]
Sindhi implosives: ॻ /g̱/, ॼ /j̱/, ॾ /ḏ/ and ॿ /ḇ/
Sinhalese nasals: ඟ /ṇ̄ga/, ඥ /ṇ̄ja/, ඬ /ṇ̄da/, ඳ /ṇ̄ḍa/ and ඹ /ṃ̄ba/
Sinhala long vowel ඈ and Devanagari vowel sign candra long E (U+0955), used in Avestan, are transliterated /ẹ̄/
The six Malayalam chillu characters represent dead consonants (without implicit vowel). As such, the are simply transliterated without adding an a next to the consonant. Hence, ൿ, ൽ, ൾ, ൻ, ൺ and ർ are respectively transliterated as /k-/, /l-/, /ll-/, /n-/, /nn-/ and /rr-/.

Urdu

Perso-Arabic characters are chosen in a non-conflicting way with the Brahmic scripts. Urdu introduces six sounds [f, z, ʒ, q, x, ɣ] on top of Hindi (see Hindustani phonology). Note that [f, z, x, ɣ] are fricatives, just like ष [ʂ], स [s], ह [h]. Excluding these IPA signs, the ones in the below table are indicative only.

Urdu

UTIL

ẖ

ḵẖ

ṯ

ẓ

z̄

s̄

s̱

ṡ

IPA

[q]

[ɦ]

[x]

[ʔ]

[ʕ]

[ɣ]

[tˤ]

[zˤ]

[z]

[ð]

[dˤ]

[sˤ]

[θ]

[ʃ]

[ʒ]

[f]

[w]

Devanagari

क़

ह़

ख़

ॽ

ॽ़

ग़

त़

ज़

स़

श

झ़

फ़

व़

Input methods for IME:s

Of course, a transliteration scheme is not so useful if it cannot be entered into a computer, for which Input Method Editors (IMEs) are used. This can be thought of as an ASCII transliteration of UTIL.

Emacs input method can be found indic-roman-postfix.el, which is a postfix input method (i.e., diacritics are entered after the character).

indic-util.mim is an m17n input method can be used with many IME’s based on libm17n like iBus, uim and fcitx.