Satish B. SettyArchiveAboutRSS Feed

Phonetic transliteration of Arabic script to Devanagari

This article presents a way of transliterating Arabic scripts, as used in Arabic, Persian and Urdu, to the Devanagari script used for Sanskrit, Hindi, Marathi, etc.

Suggestions and corrections, please bring them to my notice: MAIL.ADDRESS.

Table of Contents:–

  1. Motivation
  2. Existing Works
  3. Consonants
  4. Vowels
  5. Oddities

1. Motivation

The foundation of Sanātana Dharma rests on the tradition of pūrva-pakṣa. It is the first step in the philosophical criticism of your rival school of thought. It involves building a deep familiarity with your opponent’s point of view before criticizing it. Rajiv Malhotra’s Being Different explains the concept in detail.

In my opinion, this dharmic tradition of dialectical debate with non-Indian religions has been lost in our current times. It’s well-known that the Vedanta schools have had debates with rival schools like Atheism, Materialism, Buddhism and Jainism. But this treatment has not been extended to Islam, Christianity, Judaism and other non-Indian religions. The various gurus, svāmis and maṭhas of Vedanta have neglected these religions.

One of the stumbling blocks for this neglect is the inability to study their religious scriptures because they are in non-Indic languages (basically, “not Sanskrit”). Since the dharmic debates are inevitably in Sanskrit and today, Sanskrit is written in the Devanagari, it is imperative to transliterate their scriptures to our Devanagari script, as a first step.

One of my planned projects is to transliterate Quran into Devanagari and other Indic scripts. I believe this opens the doors for dialectic of Islam by the dharmic schools. In this regard, I present a way of writing Arabic in Devanagari. The transliteration is phonetic and completely reversible (so far). This transliteration scheme not only covers the Arabic alphabet, but also Persian and Urdu alphabet (which are derived from the Arabic) for the sake of completeness. Going from Devanagari to other Indic scripts is pretty trivial.

2. Existing work

Obviously I’m not the first one to come up with this idea. Many European scholars from the 19th century have already set a precedent. The Urdu script is a superset of the Persian script, which itself is a superset of the Arabic script. So, by transliterating the Urdu script, the other two will be automatically covered. My work in this article is largely drawn from the Preface of Forbes’ A Dictionary, Hindustānī and English. My contribution lies in converting his transliteration tables to Unicode and updating the romanization scheme to ALA-LC.

In passing, I must also mention Gilchrist’s book, The Hindee-Roman orthoepigraphical ultimatum. It is an amazing book that goes into great depths to romanize Devanagari and Urdu scripts. He provides tens of pages of tri-lingual texts (romanized Latin script, Devanagari and Urdu script) which enables one to cross-compare various orthographical elements of these scripts. He even went to the extent of inventing new Devanagari letters to represent the missing sounds in Sanskrit! Take a look (full size):

The Devanagari letters corresponding to the five variations of z (numbered 71 to 75) in the picture were invented by him!

3. Consonants

Thanks to Unicode, we don’t have to resort to such gymnastics.

In the tables below, the style of the displayed Unicode character depends on what fonts you’ve installed (ex. a naskh font or nastaliq font, etc.) and what language your browser picks up (Arabic, Persian or Urdu). I recommend SIL’s Scheherazade font. Many other Quranic fonts work well for all Arabic scripts. For Devanagari, I love Siddhanta and Chandas fonts. Pan-Unicode fonts like DejaVu Sans are the best overall.

The following table shows the transliteration for Arabic consonants in their isolated forms. The initial, medial and final forms of the Arabic consonants are irrelevant to Devanagari (since it is phonetic anyways). They are ignored by ALA-LC’s transliteration (pdf) as well.

Code point Arabic ALA-LC Devanagari
U+0628 ب b
U+0628 U+06BE بھ bh
U+067E پ p
U+067E U+06BE پھ ph
U+062A ت t
U+062A U+06BE تھ th
U+0679 ٹ
U+0679 U+06BE ٹھ ṭh
U+062B ث स॒
U+062C ج j
U+062C U+06BE جھ jh
U+0686 چ c
U+0686 U+06BE چھ ch
U+062D ح ह़
U+062E خ ḵẖ ख़
U+062F د d
U+062F U+06BE دھ dh
U+0688 ڈ
U+0688 U+06BE ڈھ ḍh
U+0630 ذ
U+0631 ر r
U+0691 ڑ ड़
U+0691 U+06BE ڑھ ṛh ठ़
U+0632 ز z
U+0698 ژ zh झ॒
U+0633 س s
U+0634 ش sh
U+0635 ص स़
U+0636 ض
U+0637 ط त़
U+0638 ظ झ़
U+0639 ع ʼ NB
U+063A غ g̱ẖ ग़
U+0641 ف f फ़
U+0642 ق q क़
U+0643 ك k
U+0643 U+06BE كھ kh
U+06AF گ g
U+06AF U+06BE گھ gh
U+0644 ل l
U+0645 م m
U+0646 ن n
U+0647 ہ h
U+0648 و v
U+064A ي y
U+0621 ء ʼ
U+0629 ة t/h थ़

NB ayn = ॽ़ (U+097D U+093C)

‡ In Persian (and hence Urdu), the two characters from Arabic script, ك (U+06A3) and ي (U+064A), are replaced respectively by ک (U+06A9) and ی (U+06CC).

Persian alphabet is basically same as the Arabic alphabet, but adds four consonants: پ (U+067E, p), چ (U+0686, c), ژ (U+0698, j), گ (U+06AF, g). These are carried over to Urdu as well.

Urdu adds 3 retroflex consonants (ṭ, ḍ, ṛ) plus 11 aspirated consonants (ex: भ, ख, etc.), all of which are needed for words borrowed from Sanskrit (via Hindi). Aspirated digraphs are formed by appending ھ (U+06BE: Heh Doachashmee) to the corresponding non-aspirated consonants. Most fonts will adjust the glyph display accordingly. Urdu also uses ہ (U+06C1: Heh Goal) instead of ه (U+0647: Heh).

Arabic has a glottal stop, ء (U+0621: hamza). This is treated as a consonant. Thankfully, Unicode 4.1 added ॽ (U+097D: glottal stop) to the Devanagari block. Combination of hamza with vowel sounds is discussed in Vowels.

3.1 Rationale

There are 15 consonants in Perso-Arabic whose equivalents do not exist in Sanskrit. Therefore, they have to be represented by diacritics on the nearest consonant. Two diacritics, a “dot below” (U+093C: nukta) and a “line below” (U+0952: stress sign anudatta) are enough to cover the distance. Some of the nukta-consonants are already available precomposed in Unicode.

ḵẖ ṛh z zh g̱ẖ f q
स॒ ह़ ठ़ झ॒ स़ त़ झ़

U+093C (nukta) in Unicode was specifically intended to be used to extend the alphabet. But technically, U+0952 (anudatta) is used only in Vedic texts. However, some consonants with a “line below” are already precomposed in the Unicode Devanagari block (ॻ, ॼ, ॾ, ॿ), which are used for non-Sanskrit texts. Therefore, I don’t see any reason why we can’t use a “line below” for other consonants as well. And that is accomplished by the anudatta. Perso-Arabic texts do not have the concept of stress or tone, therefore using anudātta in Arabic transliteration would cause no confusion whatsoever. Even though U+0952 is called Anudatta, it is also used for Svarita in the Śatapatha Brāhmaṇa, so the Unicode name (Devanagari Stress Sign Anudatta) is not totally correct anyways. I take the liberty to use Anudatta to extend the Devanagari alphabet as well.

All the nukta characters pick themselves, except for the 5 z’s (z, ẕ, ẓ, z̤, zh) and two s’s (ṣ, s̱). In order to decide their Devanagari equivalents, I apply the principle of least surprise and letter frequency distribution. The former means that a nukta should be given preference over anudatta because most Hindi-knowing Indians won’t be surprised by seeing a nukta-consonant (क़, फ़, etc.) compared to an anudatta-consonant (ॼ, झ॒, स॒). The latter means that more frequently occurring consonants should be using the nukta; and anudatta should be used elsewhere.

There is ample literature about frequency distribution of the letters of Arabic (especially Quran) and Persian (pdf). From these sources, I collated the following letter frequency table:

z zh
Persian 2.12% 0.17% 0.39% 0.26% 0.09%
Quran 0.48% 1.49% 0.51% 0.26% NA
Arabic 0.25% 1% 0.25% 0.1% NA
Dev. झ़ झ॒

‘z’ is the first most frequent in Persian, but ‘ẕ’ in Arabic. Therefore, they both are assigned the least-surprising ज-variants, ज़ and ॼ respectively, both of which are precomposed. ‘ẓ’ is the second-most frequent in both Persian and Arabic. Hence, it gets another precomposed ज-variant, the ॹ (Devanagari ZHA). ‘z̤’ is next in line because it is the next most-frequent. Moreover, ‘zh’ occurs only in Persian, so ‘z̤’ takes precedence. Since we’ve run out of ज-variants, we must use its aspirated cousin, झ’s variants. This begets झ़ for ‘z̤’ and झ॒ for ‘zh’. The ‘zh’ sounds like /ʒ/ in IPA, as in measure, so a झ-variant is not a bad choice after all.

A similar table for s:

Persian 0.12% 0.61%
Quran 0.43% 0.63%
Arabic 0.8% 1%
Dev. स॒ स़

The choice here is much more straight-forward. ‘ṣ’ is more frequent than ‘s̱’ in all cases, so the assignment of “dot below” for the former and “line below” for the latter is non-controversial.

The character ة (U+0629: teh marbuta) is the letter ه (U+0647: heh) to which the dots of ت (U+062A: teh) have been added. So, it is pronounced as /t/ त or /h/ ह depending on the feminine grammatical ending. This is a nuisance for reversible transliteration. Hence, I chose /th/ थ़ with a dot below to reflect this duality. I’m not alone in this. Other transliteration schemes like the Wehr, ISO 233 and SAS also give a dedicated letter to teh marbuta instead of trying to negotiate between /h/ and /t/.

An advantage of this scheme overall is that all the required characters and diacritics are chosen from the Devanagari Unicode block itself. Theoretically, one could use other choices like “two dots below” (स᳞ using U+1CDE; स̤ using U+0324) or a “ring below” (स̥ using U+0325) or "line below ( स̱ using U+0331), but all these require the font to support other Unicode blocks (like Vedic Extensions or Combining Diacritical Marks) in addition to Devanagari block. Therefore, it’s better to stick to nukta and anudatta.

4. Vowels

Arabic generally omits short vowels, instead, just writing a string of consonants. For example, غلب (qlb) can mean either qalb or qalab; the correct one chosen depending on the context by a speaker. However, Quran is always written with all the vowel marks put in, which is called full-vocalization. Hence, in this section, I discuss full vocalization. Also note that Perisan and Urdu have slight variations in the choice of certain vowel signs like the maksura (U+0649) and waw (U+0648), which can be ignored for now.

As with Sanskrit, the vowels in Arabic have different signs depending upon whether they occur at the beginning of a word or not (i.e, आम्र but राम, not रआम). Moreover, all Arabic vowel signs, long and short, follow a consonant. Therefore, for any word beginning with a vowel itself, instead of writing the vowel alone (like in Sanskrit), the vowel diacritic is written on top of ا (U+0627: Alef). The alef is a “carrier”, meant to “carry” the vowel sign.

Arabic has three short vowels (a, i, u), three long vowels (ā, ī, ū) and two diphthongs (ay, aw). These respectively map to Devanagari’s vowels (अ, इ, उ), (आ, ई, ऊ) and (ऐ, औ).

Initial vowel signs Non-initial vowel signs
Codepoint Arabic Devanagari Codepoint Arabic Devanagari
U+0627 U+064E اَ U+064E َ NA[1]
U+0627 U+064F اُ U+064F ُ
U+0627 U+0650 اِ U+0650 ِ ि
U+0627 U+0653[2] آ U+064E U+0627[2][3] ​ َا
U+0627 U+064F U+0648 اُو U+064F U+0648 ُو
U+0627 U+0650 U+064A اِي U+0650 U+064A ِي
U+0627 U+064E U+064A اَي U+064E U+064A َي
U+0627 U+064E U+0648 اَو U+064E U+0648 َو

These vowel signs have different names associated with them:

In addition to the above lookup table, Arabic uses certain simplification rules, which neatly map to Devanagari:

  1. Doubling of a consonant (C) is indicated by a ّ (U+0651: shadda). So, instead of just writing the same consonant twice (like CC), one writes CS where S = shadda. In Sanskrit, this is indicated by a virama plus the previous consonant. For example, क + ् + क = क्क. This is to avoid ambiguity because two consecutive consonants, CC, could represent two syllables with implicit a. (just like in Sanskrit कक vs. क्क). This process of consonant doubling is called gemination.
  2. When a consonant is not followed by any vowel sound, it is indicated by a ْ (U+0652: sukun). This is exactly the task of Devanagari Virama. For example, lb لْب = Lam + Sukun + Beh. In Devanagari, it would be lb ल्ब = La + Virama + Ba. (Don’t forget to append a virama when a word ends in a consonant.)
  3. The final vowel of a noun or adjective may have an n sound to indicate that it is indefinite. This is shown by writing the corresponding short vowel twice, to give ً (an), ٌ (un) and ٍ (in) which easily map to Devanagari न् , ुन् and िन् respectively. These three are called fathatan (U+064B), dammatan (U+064C) and kasratan (U+064D), the suffix -tan indicating doubling. This process is called nunation.

Some of the above characters are available precomposed in Unicode (ex: آ U+0622 = U+0627 + U+0653).

The letter ں (U+06BA: noon ghunna) denotes nasalization (and only occurs in final position). Nasalization elsewhere is indicated by an ordinary ن. In Devanagari, all forms of nasals are denoted by candrabindu (हैं वहाँ हँसना). { May be the final nasal should be denoted by anusvāra ? }

Alef with madda above, آ (U+0622 = U+0627 + U+0653) is always mapped to ॽा (long ा with glottal stop ॽ) in Arabic but a simple long आ in Perso-Urdu. Alef with hamza above أ (U+0632 = U+0627 + U+0654) is always mapped to implicit-a glottal stop, ॽ. Alef with hamza below إ (U+0625 = U+0627 + U+0655) is mapped to ॽि (short ि with glottal stop ॽ).

The sequence ال (U+0627 U+0644: alef lam) is always mapped to अल्

5. Oddities

Some weird rules concerning alef, hamza, etc.

  1. The sukūn may also be used to help represent a diphthong. A fatḥah followed by the letter 〈ﻱ〉 (yā’) with a sukūn over it indicates the diphthong ay (IPA /aj/). A fatḥah followed by the letter 〈ﻭ〉 (wāw) with a sukūn indicates /aw/.
  2. Nunation can end in ى (U+0649: alef maksura).
  3. Waw with hamza above. NUll?

  1. Devanagari has no sign for implicit a. So in this case, U+064E can be replaced by an empty string in software.  ↩

  2. Long ā is also represented by another redundant form. ١ى (U+0627 U+0649) when initial vowel and a corresponding َى ‎(U+0650 U+0649) for the non-initial vowel form. This character ى (U+0649) looks like “yeh without two dots” and is called Alef Maksura. This form occurs only at the end of a word.  ↩

  3. Long vowel ā in non-initial position is commonly represented by a standalone Alef (U+0627) in Persian-Urdu, instead of a Fatha + Alef (U+064E U+0627) like in Arabic.  ↩