Skip to content

Instantly share code, notes, and snippets.

@Proyag
Last active August 9, 2021 13:26
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save Proyag/2f78e6ad27ea5b717983f2dff5e1844f to your computer and use it in GitHub Desktop.
sei2uni: Converting bangtex files to Unicode

sei2uni

Created by: Proyag Pal and Palash Baran Pal

First published: February 2021


Purpose

The software package bangtex was developed for writing Bengali/Assamese in TeX/LaTeX format. Insofar as a user is interested in producing only TeX/LaTeX output in the form of a ps or pdf file, bangtex is sufficient and this program, sei2uni, is irrelevant.

The utility of this program is in making Unicode Bengali text from the files created for bangtex. This is useful because these days, many publishers want a Unicode file along with the pdf file, which helps their typesetting and pagemaking. Once the .tex file is produced, written in the bangtex format, this program produces a .txt file with Unicode Bengali in it.

After the development of original bangtex, some supporting softwares were developed which are front-ends to make the input process easier and faster. With these softwares, one first needs to write a file, and then the software needs to be applied on it to produce the .tex file that can be processed by TeX/LaTeX. One such software is seicor, developed by Somendra Mohan Bhattacharjee. If one creates a preliminary file for seicor to run on it, one can also directly apply the present program, sei2uni, on it to produce the Unicode file. In other words, in this case it is not even necessary to create the .tex file if the final interest lies in the Unicode file.

Thus, there are the following options for producing the Unicode file. This program, sei2uni, produces a Unicode .txt file from the files used in seicor. This means that, once one produces the seicor file (extension _sei.tex), one can use it two ways:

  1. Create an almost-phonetic bangtex file which uses commands to put certain vowel symbols before the consonants with which they are joined, like \*b*i\*d*esh for printing out বিদেশ in the output. No matter how this file is produced, run sei2uni on it to obtain the Unicode file.

  2. Create a file that can be transformed to the .tex file by using seicor. On this file, one can apply sei2uni to obtain the Unicode file.

Description

To run this script, you need perl to be installed. It has been tested with perl v5.16.3, v5.26.1 and v5.30.0.

Usage:

perl sei2uni.pl [options] input_file

Alternatively, if one makes sei2uni.pl an executable file, then one can use

sei2uni.pl [options] input_file

using the proper path to the file sei2uni.pl, or by including its location in the list of default paths.

Optional arguments:

  • -k, --keep-rm : Keeps the \rm tags and their associated braces in the output .txt file. The default is to remove these tags and braces.

  • -o, --output-file : Name of output file. Defaults to *.txt for an input file named *_sei.tex, otherwise to uni_out.txt.

  • -p, --placeholder : Only used internally, default is ^# and should be set to any string or character that does not appear in the input.

The output of the program will be a .txt file, whose name will be determined by the default, or by the user's specification, as described above.
This .txt file will contain all Bengali text from the sei file in Unicode characters. It will not make any change in the following parts of the sei file:

  • Any Tex/LaTeX command starting with a backslash. The inactivity region will continue until the program finds a blank space or a linebreak in the sei file.

  • Any text intended to appear in the Roman font, announced by \rm. These announcements must appear in one of the following formats in the sei file:

    • \rm{ABCD}
    • {\rm{ABCD}}
    • {\rm ABCD PQRS}

    where the capital letters indicate the presence of anything.

  • Everything in math mode, provided math mode is opened and closed by the $ sign.

Example

Here is an example of a short sei file and the .txt file produced after applying sei2uni.pl on it.

Input bangtex fileOutput Unicode file
\documentclass{barticle}

\begin{document}\bng

%<ei> \title {pRemtot/tWo}

\author {dWijen/dRolal ray}

\begin{verse} tareI bole pRem--- \\* Jokhon thake na {\rm{future}}-Er cin/ta, thakenako {\rm{shame}}---\\* tareI bole pRem.\\ Jokhon bud/dhi shud/dhi leap; \\ Jokhon %</ei> {\rm past all surgery} %<ei> Aar Jokhon %</ei> {\rm past all hope,} \\ %<ei> tare bhin/no jiibon Theke Jokhon bhari {\rm{tame}};---\\* tareI bole pRem.

\end{verse}

%</ei>

\end{document}

\documentclass{barticle}

\begin{document}\bng

\title {প্রেমতত্ত্ব}

\author {দ্বিজেন্দ্রলাল রায়}

\begin{verse} তারেই বলে প্রেম— \\* যখন থাকে না future-এর চিন্তা, থাকেনাক shame—\\* তারেই বলে প্রেম।\\ যখন বুদ্ধি শুদ্ধি লোপ; \\ যখন past all surgery আর যখন past all hope, \\ তারে ভিন্ন জীবন ঠেকে যখন ভারি tame;—\\* তারেই বলে প্রেম।

\end{verse}

\end{document}

Warnings

  1. There is no unique way of writing the ASCII sei file. For example, if one wants to produce the Bengali text ওই , one can use OoI in the input file so that the O and the I do not join in a ligature to give in the output. But the same effect can be achieved by typing O{I} or {O}I. sei2uni.pl works only on the first alternative, OoI. In other alternatives, the braces will be visible in the output.

  2. The sei2uni.pl converter is supposed to convert the text. It does not understand the Tex/LaTeX commands. So, for example, if there is a command for creating a table, the .txt file will not come out with a table. The same applies for any formatting command, like figure, or equation.

#!/usr/bin/env perl
use strict;
use warnings;
use utf8;
use Getopt::Long "GetOptions";
# Parse options
my $placeholder = '^#';
my $textout = '';
my $keep_rm;
GetOptions ("output-file|o=s" => \$textout,
"placeholder|p=s" => \$placeholder,
"keep-rm|k!" => \$keep_rm)
or die("Usage: sei2uni.pl [-k] [-o output-file] [-p placeholder]\n");
my $textin = $ARGV[0];
if ($textout eq "") {
if ($textin =~ /_sei\.tex$/) {
# For input file xyz_sei.tex, output file defaults to xyz.txt
($textout = $textin) =~ s/_sei\.tex$/\.txt/;
}
else {
# Otherwise output file defaults to uni_out.txt
$textout = "uni_out.txt";
}
}
# Open files
open(my $fh_in, '<:encoding(UTF-8)', $textin)
or die "Could not open file '$textin' for reading: $!";
open(my $fh_out, '>:encoding(UTF-8)', $textout)
or die "Could not open file '$textout' $!";
# Read file contents into one variable for multiline matching
my $file_contents = do { local $/; <$fh_in> };
$file_contents =~ s/\\\\\*/$placeholder/g;
$file_contents =~ s/\\\*//g;
$file_contents =~ s/\Q$placeholder\E/\\\\\*/g;
$file_contents =~ s/\*\{oi\}/oi/g;
$file_contents =~ s/\*\{aa\}/aa/g;
$file_contents =~ s/\*\{eou\}/eou/g;
$file_contents =~ s/\*\{ou\}/ou/g;
$file_contents =~ s/\*e/e/g;
$file_contents =~ s/\*i/i/g;
# Replace text where it shouldn't be replaced with the placeholder
my @matches;
my $match;
while ($file_contents =~ /((\$.*?\$)|(\{\\rm\{(.*?)\}\})|(\\rm\{.*?\})|(\{\\rm.*?\})|(\\\S+))/gs) {
$match = $&;
unless ($keep_rm) {
$match =~ s/\{\\rm\{(.*?)\}\}/$1/s;
$match =~ s/\\rm\{(.*?)\}/$1/s;
$match =~ s/\{\\rm(.*?)\}/$1/s;
$match =~ s/^\s+//;
chomp $match;
}
push @matches, $match;
$file_contents =~ s/\Q$&\E/$placeholder/;
}
# Checking the number of matches and placeholders
my $count = () = $file_contents =~ /\Q$placeholder\E/g;
if ($count != @matches) {
die "Error: The number of placeholders ($count) and replacements ($#matches) didn't match\n";
}
# Unicode substitutions
$file_contents =~ s/\R?\h*\%\<ei\>\h*$//gm;
$file_contents =~ s/\R?\h*\%\<\/ei\>\h*$//gm;
$file_contents =~ s/NNG/ং/g;
$file_contents =~ s/NN/ঁ/g;
$file_contents =~ s/NG/ঙ/g;
$file_contents =~ s/NJ/ঞ/g;
$file_contents =~ s/kK/ক্ষ/g;
$file_contents =~ s/g\/Y/জ্ঞ/g;
$file_contents =~ s/t\/\//ৎ/g;
$file_contents =~ s/n\/kh/ঙ্খ/g;
$file_contents =~ s/n\/gh/ঙ্ঘ/g;
$file_contents =~ s/n\/ch/ঞ্ছ/g;
$file_contents =~ s/n\/jh/ঞ্ঝ/g;
$file_contents =~ s/n\/k/ঙ্ক/g;
$file_contents =~ s/n\/g/ঙ্গ/g;
$file_contents =~ s/n\/c/ঞ্চ/g;
$file_contents =~ s/n\/j/ঞ্জ/g;
$file_contents =~ s/n\/H/হ্ন/g;
$file_contents =~ s/N\/H/হ্ণ/g;
$file_contents =~ s/kh/খ/g;
$file_contents =~ s/gh/ঘ/g;
$file_contents =~ s/ch/ছ/g;
$file_contents =~ s/jh/ঝ/g;
$file_contents =~ s/Th/ঠ/g;
$file_contents =~ s/Dh/ঢ/g;
$file_contents =~ s/th/থ/g;
$file_contents =~ s/dh/ধ/g;
$file_contents =~ s/ph/ফ/g;
$file_contents =~ s/bh/ভ/g;
$file_contents =~ s/sh/শ/g;
$file_contents =~ s/Sh/ষ/g;
$file_contents =~ s/rhh/ঢ়/g;
$file_contents =~ s/rh/ড়/g;
$file_contents =~ s/h/ঃ/g;
$file_contents =~ s/\//্/g;
$file_contents =~ s/oi/ৈ/g;
$file_contents =~ s/OI/ঐ/g;
$file_contents =~ s/eou/ৌ/g;
$file_contents =~ s/ou/ৌ/g;
$file_contents =~ s/OU/ঔ/g;
$file_contents =~ s/uu/ূ/g;
$file_contents =~ s/u/ু/g;
$file_contents =~ s/ea/ো/g;
$file_contents =~ s/e/ে/g;
$file_contents =~ s/aa/ে/g;
$file_contents =~ s/Aa/আ/g;
$file_contents =~ s/a/া/g;
$file_contents =~ s/o//g;
$file_contents =~ s/AA/এ/g;
$file_contents =~ s/A/অ/g;
$file_contents =~ s/b/ব/g;
$file_contents =~ s/c/চ/g;
$file_contents =~ s/d/দ/g;
$file_contents =~ s/D/ড/g;
$file_contents =~ s/E/এ/g;
$file_contents =~ s/g/গ/g;
$file_contents =~ s/H/হ/g;
$file_contents =~ s/ii/ী/g;
$file_contents =~ s/i/ি/g;
$file_contents =~ s/II/ঈ/g;
$file_contents =~ s/I/ই/g;
$file_contents =~ s/j/জ/g;
$file_contents =~ s/J/য/g;
$file_contents =~ s/k/ক/g;
$file_contents =~ s/l/ল/g;
$file_contents =~ s/L/্ল/g;
$file_contents =~ s/m/ম/g;
$file_contents =~ s/M/্ম/g;
$file_contents =~ s/n/ন/g;
$file_contents =~ s/N/ণ/g;
$file_contents =~ s/O/ও/g;
$file_contents =~ s/p/প/g;
$file_contents =~ s/rR/ৃ/g;
$file_contents =~ s/r/র/g;
$file_contents =~ s/RR/ঋ/g;
$file_contents =~ s/R/্র/g;
$file_contents =~ s/s/স/g;
$file_contents =~ s/t/ত/g;
$file_contents =~ s/T/ট/g;
$file_contents =~ s/UU/ঊ/g;
$file_contents =~ s/U/উ/g;
$file_contents =~ s/W/্ব/g;
$file_contents =~ s/y/য়/g;
$file_contents =~ s/Y/্য/g;
$file_contents =~ s/ঁা/াঁ/g;
$file_contents =~ s/ঁ্যা/্যাঁ/g;
$file_contents =~ s/ঁি/িঁ/g;
$file_contents =~ s/ঁী/ীঁ/g;
$file_contents =~ s/ঁু/ুঁ/g;
$file_contents =~ s/ঁূ/ূঁ/g;
$file_contents =~ s/ঁৃ/ৃঁ/g;
$file_contents =~ s/ঁে/েঁ/g;
$file_contents =~ s/ঁৈ/ৈঁ/g;
$file_contents =~ s/ঁো/োঁ/g;
$file_contents =~ s/ঁৌ/ৌঁ/g;
$file_contents =~ s/\./।/g;
$file_contents =~ s/।।।/\.\.\./g;
$file_contents =~ s/---/—/g;
$file_contents =~ s/--/–/g;
$file_contents =~ s/\`\`/❝/g;
$file_contents =~ s/\`/❛/g;
$file_contents =~ s/\'\'/❞/g;
$file_contents =~ s/\'/❜/g;;
$file_contents =~ s/\"/❞❯/g;
$file_contents =~ s/0/০/g;
$file_contents =~ s/1/১/g;
$file_contents =~ s/2/২/g;
$file_contents =~ s/3/৩/g;
$file_contents =~ s/4/৪/g;
$file_contents =~ s/5/৫/g;
$file_contents =~ s/6/৬/g;
$file_contents =~ s/7/৭/g;
$file_contents =~ s/8/৮/g;
$file_contents =~ s/9/৯/g;
# Replace the placeholders with the unaffected text
while ($file_contents =~ /\Q$placeholder\E/g) {
$file_contents =~ s/\Q$&\E/$matches[0]/;
shift @matches;
}
# Write out
print $fh_out $file_contents;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment