EIFIS – Extended IAST for Indic Scripts
Introduction
IAST, International Alphabet of Sanskrit Transliteration, was adopted as early as 1894 for transliterating Sanskrit in to the Latin script. Today, Sanskrit is written in Devanagari script – which has more characters than needed for Sanskrit because it is used to write many other languages, from Bihari to Kashmiri and many Indian languages.
This document proposes backwards-compatible extensions to IAST to tabulate mappings not just for Sanskrit, but for all Indic scripts, including those for Dravidian languages and Southeast Asian languages. I’ve called this scheme EIFIS – Extended IAST for Indic Scripts.
A reference implementation for transliterating between Indic and Latin, written in Python, is provided on GitHub under an open source license. Also provided is an Emacs keyboard input method for the same.
To view the transliteration tables, you’ll need Unicode fonts like Siddhanta and GNU FreeSans/FreeMono.
Objectives
- To provide mappings for all the characters assigned in Unicode 6.2 Devanagari block.
- To provide additional mappings for characters unique to Dravidian languages.
- To provide additional mappings unique to Sinhalese (Sri Lankan), Pali, and Perso-Arabic.
- All the Latin mappings must occupy only one Unicode codepoint, meaning only precomposed (normalized) Unicode characters are used, even for those characters with diacritics.
It is the last point that sets EIFIS apart from schemes like ISO 15919. For example, vocalic R in Sanskrit is mapped to / r̥ / (r with combining ring below, \u0072 + \u0325) in ISO 15919. On the other hand, IAST’s / ṛ / (r with dot below, \u1E5B), takes up only one codepoint, simplifying lookup tables, because the combining diacritics needn’t be stored at all. This not only reduces the text document size (be it in UTF8 or UTF16), but also eases software development. Other limitations of ISO 15919 compared to EIFIS are:
Codepoints newly assigned in recent Unicode releases such as for Kashmiri vowels and Sindhi implosives don’t have any mappings at all.
Doesn’t provide mappings at all for languages like Thai or Tibetan – these are derived from Brahmic scripts as well and bear a lot of similarity to Sanskrit.
The 7-bit ASCII fallback is based on prefix notation – this is probably ok for input methods, but for displaying text, it is not so readable for native English readers. Compare these: ,r;si.h k,r.s.na;m pankajama;m (ISO 15919, prefix) r.s.ih. kr.s.n.am. pankajamam. (EIFIS, suffix)
Of course, EIFIS allows prefix notation for those who need it, see for example, the Emacs input method.
EIFIS aims to provide coherent and consisent mappings across all the afore-mentioned scripts.
Transliteration Table — Part 1
The first table shows the mapping between Devanagari to Latin (and vice versa). If you cannot view the table correctly, download the PDF version of Table 1 [89 KiB].
Notes
- All the Latin characters in the above table are available precomposed in Unicode, thus satisfying objective #4.
- / ॕ / (U+0955, Devanagari vowel sign candra long E) is used while writing Avestan in Devanagari. In EIFIS, it is transliterated as / ệ /.
- ISO 15919 specifies only / m̐ / (U+006D, U+0310). This is equivalent to / ḿ / in EIFIS. In addition, EIFIS allows / ĺ / and / ń / stand for L with candrabindu and N with candrabindu respectively. These are sometimes used for Upanishadic verses.
- / ड़ / (U+095C, Devanagari letter DDDHA) and / ढ़ / (U+095D, Devanagari letter RHA) are transliterated as / ṙ / and / ṙh / to be consistent with Bengali (see below).
- / ೱ / (Jihvāmūlīya) and / ೲ / (upadhmānīya) are transliterated as / ẖ / and / ḫ / respectively.
- Generally, the end of a sentence and the end of a verse are respectively indicated by a daṇḍa (U+0964) and double daṇḍa (U+0965) in Devanagari. Both of these are transliterated as a full stop (U+002E) in the Latin alphabet. However, in those cases where they need to be differentiated, / | / (U+007C, vertical line) and /। / (U+2016, double vertical line) can be used.
- Although not explicitly mentioned, note the mappings for Kashmiri vowels (aw, ö, ȫ, ü, ǖ) and Sindhi implosives (ǧ, ǰ, ḇ, ḏ).
Transliteration Table — Part 2
The second table shows the mappings of those characters which are unique to certain scripts. You can also download the PDF version [103 KiB].
Notes
- As before, all the Latin characters in the above table are available precomposed in Unicode, thus satisfying objective #4.
- The Gurmukhi characters / ੲ / (iṙi) and / ੳ / (uṙa) are base for vowels /i/ and /u/ respectively. They are meaningful only in combination and never used on their own.
- Gurmukhi / ੱ / (aḍḍak) doubles the following consonant which can be transliterated as / kk /, / pp / etc. in the Latin alphabet.
- Gurmukhi sacred symbol / ੴ / (ōṅkār) can be transliterated as ōṅkār itself.
- Gurmukhi / u / and / uu / take bindi in their initial forms and ṭippi when used after a consonant. All other short vowels take ṭippi and all other long vowels take bindi.
- Telugu / ౘ / (TSA) and / ౙ / (DZA) are transliterated as / č / and / ž / respectively.
- Telugu candrabindu (U+0C01) is sometimes used as a half-nasal. In such a case, it is to be transliterated as / ň /
- The six Malayalam chillu characters represent dead consonants (without implicit vowel). As such, the are simply transliterated without adding an a next to the consonant. Hence, ൿ, ൽ, ൾ, ൻ, ൺ and ർ are respectively transliterated as k-, l-, ll-, n-, nn- and rr-.
Sinhalese
Perso-Arabic
Transliteration Table — Part 3
Javanese and Balinese
Thai and Lao
Tibetan
Vedic accents
Rationale
- IAST uses five diacritics: dot above (ṅ), dot below (ṭ, ḍ, ṛ, ṣ), macron above (ā, ī, ū), tilde above (ñ) and acute accent above (ś). EIFIS extends these diacritics to other characters (ġ, ṙ, ẏ, ō, ū, ḵ, ṯ, ṉ, ḿ, ĺ, etc.: see tables above). In addition, two diacritics are introduced: caron above (ǎ, ǧ, ǰ, ě, etc.) and diaeresis above (ö, ȫ, ü, ǖ only).