Skip to content

Instantly share code, notes, and snippets.

@shreevatsa
Last active September 5, 2021 18:04
Show Gist options
  • Save shreevatsa/4d476ac26a367fa68984d8c06867d7dd to your computer and use it in GitHub Desktop.
Save shreevatsa/4d476ac26a367fa68984d8c06867d7dd to your computer and use it in GitHub Desktop.
Python script to get Devanagari input from a file, transliterate it, call `devnag` or `skt` preprocessor on it, and write out relevant output to new file. Supporting code for https://tex.stackexchange.com/questions/296266/old-sanskrit-fonts-and-unicode-input/358887#358887

This preprocesses Devanagari text for use with \dn or \skt.

Not very well tested, but seems to work.

Intended usage

It was originally written for this answer and meant to be used with the macros there.

Inconvenient usage

It's inconvenient to use by itself, but if you only need Devanagari text for some one-off use case, you could use it if you wish.

Example usage for dn

Create a file, say myfile-dev-for-dn containing (only) Devanagari input, for example the file may contain

धर्मक्षेत्रे कार्त्स्न्यम् विद्भिः

Then run

python get-dn.py dn test.tex

This will create a file called myfile-dev-for-dn.devnagout containing:

{\dn Dm\0\322w\?/\? kA(-\306wy\0\qq{m} EvE\389w,
}

So you can put that into your .tex file.

Example usage for skt

Similar, run the script with skt instead of dn etc.

from __future__ import unicode_literals
import os
import re
import subprocess
import sys
consonants = {
0x0915: ['k'],
0x0916: ['kh'],
0x0917: ['g'],
0x0918: ['gh'],
0x0919: ['"n'],
0x091A: ['c'],
0x091B: ['ch'],
0x091C: ['j'],
0x091D: ['jh'],
0x091E: ['~n'],
0x091F: ['.t'],
0x0920: ['.th'],
0x0921: ['.d'],
0x0922: ['.dh'],
0x0923: ['.n'],
0x0924: ['t'],
0x0925: ['th'],
0x0926: ['d'],
0x0927: ['dh'],
0x0928: ['n'],
0x092A: ['p'],
0x092B: ['ph'],
0x092C: ['b'],
0x092D: ['bh'],
0x092E: ['m'],
0x092F: ['y'],
0x0930: ['r'],
0x0932: ['l'],
0x0933: ['L'],
0x0935: ['v'],
0x0936: ['"s'],
0x0937: ['.s'],
0x0938: ['s'],
0x0939: ['h'],
0x0958: ['q'],
0x0959: ['.kh'],
0x095A: ['.g'],
0x095B: ['z'],
0x095C: ['R'],
0x095D: ['Rh'],
0x095E: ['f'],
}
vowel_signs = {
0x093E: ['aa'],
0x093F: ['i'],
0x0940: ['ii'],
0x0941: ['u'],
0x0942: ['uu'],
0x0943: ['.r'],
0x0944: ['.R', '.r.r'],
0x0947: ['e'],
0x0948: ['ai'],
0x0949: ['~o'],
0x094B: ['o'],
0x094C: ['au'],
0x0962: ['.l'],
0x0963: ['.ll', '.l.l'],
}
vowels = {
0x0905: ['a'],
0x0906: ['aa'],
0x0907: ['i'],
0x0908: ['ii'],
0x0909: ['u'],
0x090A: ['uu'],
0x090B: ['.r'],
0x090C: ['.l'],
0x090F: ['e'],
0x0910: ['ai'],
0x0913: ['o'],
0x0914: ['au'],
0x0960: ['.R'],
0x0961: ['.L'],
0x0972: ['~a'],
}
other = {
# 0x002E: ['..'],
0x0901: ['/'],
0x0902: ['.m'],
0x0903: ['.h'],
0x093D: ['.a'],
0x094D: ['&'],
0x0950: ['.o'],
0x0964: ['|'],
0x0965: ['||'],
0x0966: ['0'],
0x0967: ['1'],
0x0968: ['2'],
0x0969: ['3'],
0x096A: ['4'],
0x096B: ['5'],
0x096C: ['6'],
0x096D: ['7'],
0x096E: ['8'],
0x096F: ['9'],
0x0970: ['@'],
0x0971: ['#'],
}
re_consonant = '|'.join(unichr(n) for n in consonants)
re_vowel_sign = '|'.join(unichr(n) for n in vowel_signs)
re_vowel = '|'.join(unichr(n) for n in vowels)
re_other = '|'.join(unichr(n) for n in other)
re_virama = unichr(0x094D)
re_a = vowels[0x0905][0] # 'a'
def velthuis(devanagari):
text = devanagari
text = re.sub('(%s)(%s)' % (re_consonant, re_vowel_sign),
lambda match: consonants[ord(match.group(1))][0] + vowel_signs[ord(match.group(2))][0],
text)
text = re.sub('(%s)(%s)' % (re_consonant, re_virama),
lambda match: consonants[ord(match.group(1))][0],
text)
text = re.sub('(%s)' % re_consonant,
lambda match: consonants[ord(match.group(1))][0] + re_a,
text)
text = re.sub('(%s)' % re_vowel,
lambda match: vowels[ord(match.group(1))][0],
text)
text = re.sub('(%s)' % re_other,
lambda match: other[ord(match.group(1))][0],
text)
return text
def wikner(devanagari):
text = devanagari
text = re.sub('(%s)(%s)' % (re_consonant, re_vowel_sign),
lambda match: consonants[ord(match.group(1))][-1] + vowel_signs[ord(match.group(2))][-1],
text)
text = re.sub('(%s)(%s)' % (re_consonant, re_virama),
lambda match: consonants[ord(match.group(1))][-1],
text)
text = re.sub('(%s)' % re_consonant,
lambda match: consonants[ord(match.group(1))][-1] + re_a,
text)
text = re.sub('(%s)' % re_vowel,
lambda match: vowels[ord(match.group(1))][-1],
text)
text = re.sub('(%s)' % re_other,
lambda match: other[ord(match.group(1))][-1],
text)
return text
random_filename = 'lwfzal3XBeV8H10I8f4n'
def get_preprocessed(filename, ext):
preprocessor = {
'dn': 'devnag',
'skt': './skt',
}
assert ext in preprocessor.keys(), ext
text = open(filename).read().decode('utf-8')
transliterated = velthuis(text) if ext == 'dn' else wikner(text)
infile = '%s-%s.%s' % (random_filename, ext, ext)
open(infile, 'w').write(r'{\%s %s}' % (ext, transliterated))
p = subprocess.Popen([preprocessor[ext], infile],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
close_fds=True)
out, err, ret = p.stdout.read(), p.stderr.read(), p.returncode
if err or ret:
print 'input: <%s>' % text.encode('utf-8')
print 'transliterated: <%s>' % transliterated
print 'stdout: <%s>' % out
print 'stderr: <%s>' % err
print 'returned: <%s>' % ret
raise ValueError
outfile = '%s-%s.tex' % (random_filename, ext)
translation = open(outfile).read()
os.remove(outfile)
os.remove(infile)
prefix = r'\def\DevnagVersion{2.17}{\dn ' if ext == 'dn' else r'{\skt '
assert translation.startswith(prefix), translation
assert translation.endswith('}')
translation = translation[len(prefix):-1]
return translation
if __name__ == '__main__':
ext = sys.argv[1]
filename = sys.argv[2]
out = get_preprocessed(filename, ext)
open(filename + '.devnagout', 'w').write(r'{\%s %s}' % (ext, out))
@shreevatsa
Copy link
Author

@kanginthaya This was just example code for this answer: https://tex.stackexchange.com/questions/296266/old-sanskrit-fonts-and-unicode-input/358887#358887 — at the end of which some example code is given. Have added a link to that answer at the top of this gist; let me know if that helps. Cheers,

@VP007-py
Copy link

VP007-py commented Jul 22, 2020

Hey @shreevatsa

I tried to convert to Velthuis from hindi text as mentioned here (to add it latex file)

Upon running the file with the text मेरे माता और पिता को समर्पित as python2 convert.py dn file

I get

Traceback (most recent call last):
  File "convert.py", line 191, in <module>
    out = get_preprocessed(filename, ext)
  File "convert.py", line 169, in get_preprocessed
    close_fds=True)
  File "/usr/lib/python2.7/subprocess.py", line 394, in __init__
    errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1047, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory


@shreevatsa
Copy link
Author

@kanginthaya

Is it possible to add a readme with a minimum working example? Thanks!

How to use this was indeed less clear than I thought. I've added a README now. It may not be of much use now, one year later, sorry!


@PVinay737

Upon running the file […] I get OSError: [Errno 2] No such file or directory

I get this error if the file does not actually exist. E.g. if I run python get-dn.py dn foo and the file foo does not exist.

@ritwikmishra
Copy link

ritwikmishra commented Sep 5, 2021

When converting a long text

सफदर हाशमी एक कम्युनिस्ट नाटककार, कलाकार, निर्देशक, गीतकार और कलाविद थे। उन्हे नुक्कड़ नाटक के साथ उनके जुड़ाव के लिए जाना जाता है। भारत के राजनैतिक थिएटर में आज भी वे एक महत्वपूर्ण स्थान रखते हैं। सफदर जन नाट्य मंच और दिल्ली में स्टूडेंट्स फेडरेशन ऑफ इंडिया (एसएफआई) के स्थापक-सदस्य थे। जन नाट्य मंच की नींव १९७३ में रखी गई थी, जनम ने इप्टा से अलग हटकर आकार लिया था। सफदर की जनवरी १९८९ में साहिबाबाद में एक नुक्कड़ नाटक 'हल्ला बोल' खेलते हुए हत्या कर दी गई थी। 

This error comes

$ python2 gen-dn.py dn file.txt 
Traceback (most recent call last):
  File "gen-dn.py", line 191, in <module>
    out = get_preprocessed(filename, ext)
  File "gen-dn.py", line 165, in get_preprocessed
    open(infile, 'w').write(r'{\%s %s}' % (ext, transliterated))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u093c' in position 124: ordinal not in range(128)

UPDATE: I found out the issue. Velthuis system of transliteration does not have any provision for diacritics. So a commonly used symbol, nuqta, does not have any representation in this system. Hence words like ड़, ढ़, ख़ were creating problems. Unicode normalization also does not works for them (NFD works fine but NFC fails). Hence i just replaced those combination characters (ka + nuqta, \u0915+\u093c) with their unicode characters (ka with nuqta, \u0958). And it worked fine.

Here is the gist: https://gist.github.com/ritwikmishra/9f8d6de45aff8fbe959d4260269d9eeb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment