Proyag/sei2uni.md

## sei2uni.md

      
    Raw
  

              sei2uni.md
            
          
    sei2uni

Created by: Proyag Pal and Palash Baran Pal

First published: February 2021


Purpose

The software package bangtex
was developed for writing
Bengali/Assamese in TeX/LaTeX format. Insofar as a user is
interested in producing only TeX/LaTeX output in the form of
a ps or pdf file, bangtex is sufficient and this program,
sei2uni, is irrelevant.
The utility of this program is in making Unicode Bengali text
from the files created for bangtex. This is useful because
these days, many publishers want a Unicode file along with the
pdf file, which helps their typesetting and pagemaking.
Once the .tex file is produced, written in the bangtex
format, this program produces a .txt file with Unicode Bengali in it.
After the development of original bangtex, some supporting
softwares were developed which are front-ends to make the input
process easier and faster.  With these softwares, one first needs
to write a file, and then the software needs to be applied on it
to produce the .tex file that can be processed by TeX/LaTeX.
One such software is seicor,
developed by Somendra Mohan Bhattacharjee.
If one creates a preliminary file for seicor to run on it, one can
also directly apply the present program, sei2uni, on it to
produce the Unicode file.  In other words, in this case it is not
even necessary to create the .tex file if the final interest
lies in the Unicode file.
Thus, there are the following options for producing the Unicode file.
This program, sei2uni, produces a Unicode .txt file from the
files used in seicor. This means that, once one produces the seicor
file (extension _sei.tex), one can use it two ways:


Create an almost-phonetic bangtex file which uses commands
to put certain vowel symbols before the consonants with which they
are joined, like \*b*i\*d*esh for printing out বিদেশ in the output.
No matter how this file is produced, run sei2uni on it to obtain
the Unicode file.


Create a file that can be transformed to the .tex file by using
seicor.  On this file, one can apply sei2uni to obtain the Unicode file.


Description

To run this script, you need perl to be installed. It has been tested
with perl v5.16.3, v5.26.1 and v5.30.0.
Usage:

perl sei2uni.pl [options] input_file
Alternatively, if one makes sei2uni.pl an executable
file, then one can use
sei2uni.pl [options] input_file
using the proper path to the file sei2uni.pl, or
by including its location in the list of default paths.
Optional arguments:


-k, --keep-rm : Keeps the \rm
tags and their associated braces in the output .txt file.
The default is to remove these tags and braces.


-o, --output-file : Name of output file.
Defaults to *.txt for an input file named *_sei.tex,
otherwise to uni_out.txt.


-p, --placeholder : Only used internally,
default is ^# and should be set to any
string or character that does not appear in the input.


The output of the program will be a .txt file,
whose name will be determined by the default, or by the user's
specification, as described above.

This .txt file will contain all Bengali text from the sei file in
Unicode characters. It will not make any change in the following
parts of the sei file:


Any Tex/LaTeX command starting with a backslash. The
inactivity region will continue until the program finds a blank
space or a linebreak in the sei file.


Any text intended to appear in the Roman font, announced
by \rm. These announcements must appear in one of
the following formats in the sei file:

\rm{ABCD}
{\rm{ABCD}}
{\rm ABCD PQRS}

where the capital letters indicate the presence of anything.


Everything in math mode, provided math mode is
opened and closed by the $ sign.


Example

Here is an example of a short sei file and the .txt file produced
after applying sei2uni.pl on it.

Input bangtex file Output Unicode file


\documentclass{barticle}
\begin{document}\bng
%<ei>
\title {pRemtot/tWo}
\author {dWijen/dRolal ray}
\begin{verse}
tareI bole pRem--- \\*
Jokhon thake na {\rm{future}}-Er cin/ta, thakenako {\rm{shame}}---\\*
tareI bole pRem.\\
Jokhon bud/dhi shud/dhi leap; \\
Jokhon
%</ei>
{\rm past all surgery}
%<ei>
Aar Jokhon
%</ei>
{\rm past all hope,} \\
%<ei>
tare bhin/no jiibon Theke Jokhon bhari {\rm{tame}};---\\*
tareI bole pRem.
\end{verse}
%</ei>
\end{document}


\documentclass{barticle}
\begin{document}\bng
\title {প্রেমতত্ত্ব}
\author {দ্বিজেন্দ্রলাল রায়}
\begin{verse}
তারেই বলে প্রেম— \\*
যখন থাকে না future-এর চিন্তা, থাকেনাক shame—\\*
তারেই বলে প্রেম।\\
যখন বুদ্ধি শুদ্ধি লোপ; \\
যখন
past all surgery
আর যখন
past all hope, \\
তারে ভিন্ন জীবন ঠেকে যখন ভারি tame;—\\*
তারেই বলে প্রেম।
\end{verse}
\end{document}


Warnings


There is no unique way of writing the ASCII sei file. For example,
if one wants to produce the Bengali text ওই , one can
use OoI in the input file so that the O
and the I do not join in a ligature to give ঐ in the
output. But the same effect can be achieved by
typing O{I} or {O}I.
sei2uni.pl works only on the first
alternative, OoI. In other alternatives, the braces
will be visible in the output.


The sei2uni.pl converter is supposed to convert the
text. It does not understand the Tex/LaTeX commands. So, for
example, if there is a command for creating a table, the
.txt file will not come out with a table. The same
applies for any formatting command, like figure, or equation.


## sei2uni.pl
#!/usr/bin/env perl

use strict;
use warnings;
use utf8;
use Getopt::Long "GetOptions";

# Parse options
my $placeholder = '^#';
my $textout = '';
my $keep_rm;
GetOptions ("output-file|o=s" => \$textout,
            "placeholder|p=s" => \$placeholder,
            "keep-rm|k!"     => \$keep_rm)
  or die("Usage: sei2uni.pl [-k] [-o output-file] [-p placeholder]\n");

my $textin = $ARGV[0];
if ($textout eq "") {
  if ($textin =~ /_sei\.tex$/) {
    # For input file xyz_sei.tex, output file defaults to xyz.txt
    ($textout = $textin) =~ s/_sei\.tex$/\.txt/;
  }
  else {
    # Otherwise output file defaults to uni_out.txt
    $textout = "uni_out.txt";
  }
}

# Open files
open(my $fh_in, '<:encoding(UTF-8)', $textin)
  or die "Could not open file '$textin' for reading: $!";
open(my $fh_out, '>:encoding(UTF-8)', $textout)
  or die "Could not open file '$textout' $!";

# Read file contents into one variable for multiline matching
my $file_contents = do { local $/; <$fh_in> };

$file_contents =~ s/\\\\\*/$placeholder/g;
$file_contents =~ s/\\\*//g;
$file_contents =~ s/\Q$placeholder\E/\\\\\*/g;
$file_contents =~ s/\*\{oi\}/oi/g;
$file_contents =~ s/\*\{aa\}/aa/g;
$file_contents =~ s/\*\{eou\}/eou/g;
$file_contents =~ s/\*\{ou\}/ou/g;
$file_contents =~ s/\*e/e/g;
$file_contents =~ s/\*i/i/g;

# Replace text where it shouldn't be replaced with the placeholder
my @matches;
my $match;
while ($file_contents =~ /((\$.*?\$)|(\{\\rm\{(.*?)\}\})|(\\rm\{.*?\})|(\{\\rm.*?\})|(\\\S+))/gs) {
  $match = $&;
  unless ($keep_rm) {
    $match =~ s/\{\\rm\{(.*?)\}\}/$1/s;
    $match =~ s/\\rm\{(.*?)\}/$1/s;
    $match =~ s/\{\\rm(.*?)\}/$1/s;
    $match =~ s/^\s+//;
    chomp $match;
  }
  push @matches, $match;
  $file_contents =~ s/\Q$&\E/$placeholder/;
}

# Checking the number of matches and placeholders
my $count = () = $file_contents =~ /\Q$placeholder\E/g;
if ($count != @matches) {
  die "Error: The number of placeholders ($count) and replacements ($#matches) didn't match\n";
}

# Unicode substitutions
$file_contents =~ s/\R?\h*\%\<ei\>\h*$//gm;
$file_contents =~ s/\R?\h*\%\<\/ei\>\h*$//gm;

$file_contents =~ s/NNG/ং/g;
$file_contents =~ s/NN/ঁ/g;
$file_contents =~ s/NG/ঙ/g;
$file_contents =~ s/NJ/ঞ/g;
$file_contents =~ s/kK/ক্ষ/g;
$file_contents =~ s/g\/Y/জ্ঞ/g;
$file_contents =~ s/t\/\//ৎ/g;
$file_contents =~ s/n\/kh/ঙ্খ/g;
$file_contents =~ s/n\/gh/ঙ্ঘ/g;
$file_contents =~ s/n\/ch/ঞ্ছ/g;
$file_contents =~ s/n\/jh/ঞ্ঝ/g;
$file_contents =~ s/n\/k/ঙ্ক/g;
$file_contents =~ s/n\/g/ঙ্গ/g;
$file_contents =~ s/n\/c/ঞ্চ/g;
$file_contents =~ s/n\/j/ঞ্জ/g;
$file_contents =~ s/n\/H/হ্ন/g;
$file_contents =~ s/N\/H/হ্ণ/g;

$file_contents =~ s/kh/খ/g;
$file_contents =~ s/gh/ঘ/g;
$file_contents =~ s/ch/ছ/g;
$file_contents =~ s/jh/ঝ/g;
$file_contents =~ s/Th/ঠ/g;
$file_contents =~ s/Dh/ঢ/g;
$file_contents =~ s/th/থ/g;
$file_contents =~ s/dh/ধ/g;
$file_contents =~ s/ph/ফ/g;
$file_contents =~ s/bh/ভ/g;
$file_contents =~ s/sh/শ/g;
$file_contents =~ s/Sh/ষ/g;
$file_contents =~ s/rhh/ঢ়/g;
$file_contents =~ s/rh/ড়/g;
$file_contents =~ s/h/ঃ/g;

$file_contents =~ s/\//্/g;
$file_contents =~ s/oi/ৈ/g;
$file_contents =~ s/OI/ঐ/g;
$file_contents =~ s/eou/ৌ/g;
$file_contents =~ s/ou/ৌ/g;
$file_contents =~ s/OU/ঔ/g;
$file_contents =~ s/uu/ূ/g;
$file_contents =~ s/u/ু/g;
$file_contents =~ s/ea/ো/g;
$file_contents =~ s/e/ে/g;
$file_contents =~ s/aa/ে/g;
$file_contents =~ s/Aa/আ/g;
$file_contents =~ s/a/া/g;
$file_contents =~ s/o//g;

$file_contents =~ s/AA/এ/g;
$file_contents =~ s/A/অ/g;
$file_contents =~ s/b/ব/g;
$file_contents =~ s/c/চ/g;
$file_contents =~ s/d/দ/g;
$file_contents =~ s/D/ড/g;
$file_contents =~ s/E/এ/g;
$file_contents =~ s/g/গ/g;
$file_contents =~ s/H/হ/g;
$file_contents =~ s/ii/ী/g;
$file_contents =~ s/i/ি/g;
$file_contents =~ s/II/ঈ/g;
$file_contents =~ s/I/ই/g;
$file_contents =~ s/j/জ/g;
$file_contents =~ s/J/য/g;
$file_contents =~ s/k/ক/g;
$file_contents =~ s/l/ল/g;
$file_contents =~ s/L/্ল/g;
$file_contents =~ s/m/ম/g;
$file_contents =~ s/M/্ম/g;
$file_contents =~ s/n/ন/g;
$file_contents =~ s/N/ণ/g;
$file_contents =~ s/O/ও/g;
$file_contents =~ s/p/প/g;
$file_contents =~ s/rR/ৃ/g;
$file_contents =~ s/r/র/g;
$file_contents =~ s/RR/ঋ/g;
$file_contents =~ s/R/্র/g;
$file_contents =~ s/s/স/g;
$file_contents =~ s/t/ত/g;
$file_contents =~ s/T/ট/g;
$file_contents =~ s/UU/ঊ/g;
$file_contents =~ s/U/উ/g;
$file_contents =~ s/W/্ব/g;
$file_contents =~ s/y/য়/g;
$file_contents =~ s/Y/্য/g;

$file_contents =~ s/ঁা/াঁ/g;
$file_contents =~ s/ঁ্যা/্যাঁ/g;
$file_contents =~ s/ঁি/িঁ/g;
$file_contents =~ s/ঁী/ীঁ/g;
$file_contents =~ s/ঁু/ুঁ/g;
$file_contents =~ s/ঁূ/ূঁ/g;
$file_contents =~ s/ঁৃ/ৃঁ/g;
$file_contents =~ s/ঁে/েঁ/g;
$file_contents =~ s/ঁৈ/ৈঁ/g;
$file_contents =~ s/ঁো/োঁ/g;
$file_contents =~ s/ঁৌ/ৌঁ/g;

$file_contents =~ s/\./।/g;
$file_contents =~ s/।।।/\.\.\./g;
$file_contents =~ s/---/—/g;
$file_contents =~ s/--/–/g;
$file_contents =~ s/\`\`/❝/g;
$file_contents =~ s/\`/❛/g;
$file_contents =~ s/\'\'/❞/g;
$file_contents =~ s/\'/❜/g;;
$file_contents =~ s/\"/❞❯/g;

$file_contents =~ s/0/০/g;
$file_contents =~ s/1/১/g;
$file_contents =~ s/2/২/g;
$file_contents =~ s/3/৩/g;
$file_contents =~ s/4/৪/g;
$file_contents =~ s/5/৫/g;
$file_contents =~ s/6/৬/g;
$file_contents =~ s/7/৭/g;
$file_contents =~ s/8/৮/g;
$file_contents =~ s/9/৯/g;

# Replace the placeholders with the unaffected text
while ($file_contents =~ /\Q$placeholder\E/g) {
  $file_contents =~ s/\Q$&\E/$matches[0]/;
  shift @matches;
}

# Write out
print $fh_out $file_contents;
	#!/usr/bin/env perl

	use strict;
	use warnings;
	use utf8;
	use Getopt::Long "GetOptions";

	# Parse options
	my $placeholder = '^#';
	my $textout = '';
	my $keep_rm;
	GetOptions ("output-file\|o=s" => \$textout,
	"placeholder\|p=s" => \$placeholder,
	"keep-rm\|k!" => \$keep_rm)
	or die("Usage: sei2uni.pl [-k] [-o output-file] [-p placeholder]\n");

	my $textin = $ARGV[0];
	if ($textout eq "") {
	if ($textin =~ /_sei\.tex$/) {
	# For input file xyz_sei.tex, output file defaults to xyz.txt
	($textout = $textin) =~ s/_sei\.tex$/\.txt/;
	}
	else {
	# Otherwise output file defaults to uni_out.txt
	$textout = "uni_out.txt";
	}
	}

	# Open files
	open(my $fh_in, '<:encoding(UTF-8)', $textin)
	or die "Could not open file '$textin' for reading: $!";
	open(my $fh_out, '>:encoding(UTF-8)', $textout)
	or die "Could not open file '$textout' $!";

	# Read file contents into one variable for multiline matching
	my $file_contents = do { local $/; <$fh_in> };

	$file_contents =~ s/\\\\\*/$placeholder/g;
	$file_contents =~ s/\\\*//g;
	$file_contents =~ s/\Q$placeholder\E/\\\\\*/g;
	$file_contents =~ s/\*\{oi\}/oi/g;
	$file_contents =~ s/\*\{aa\}/aa/g;
	$file_contents =~ s/\*\{eou\}/eou/g;
	$file_contents =~ s/\*\{ou\}/ou/g;
	$file_contents =~ s/\*e/e/g;
	$file_contents =~ s/\*i/i/g;

	# Replace text where it shouldn't be replaced with the placeholder
	my @matches;
	my $match;
	while ($file_contents =~ /((\$.?\$)\|(\{\\rm\{(.?)\}\})\|(\\rm\{.?\})\|(\{\\rm.?\})\|(\\\S+))/gs) {
	$match = $&;
	unless ($keep_rm) {
	$match =~ s/\{\\rm\{(.*?)\}\}/$1/s;
	$match =~ s/\\rm\{(.*?)\}/$1/s;
	$match =~ s/\{\\rm(.*?)\}/$1/s;
	$match =~ s/^\s+//;
	chomp $match;
	}
	push @matches, $match;
	$file_contents =~ s/\Q$&\E/$placeholder/;
	}

	# Checking the number of matches and placeholders
	my $count = () = $file_contents =~ /\Q$placeholder\E/g;
	if ($count != @matches) {
	die "Error: The number of placeholders ($count) and replacements ($#matches) didn't match\n";
	}

	# Unicode substitutions
	$file_contents =~ s/\R?\h\%\<ei\>\h$//gm;
	$file_contents =~ s/\R?\h\%\<\/ei\>\h$//gm;

	$file_contents =~ s/NNG/ং/g;
	$file_contents =~ s/NN/ঁ/g;
	$file_contents =~ s/NG/ঙ/g;
	$file_contents =~ s/NJ/ঞ/g;
	$file_contents =~ s/kK/ক্ষ/g;
	$file_contents =~ s/g\/Y/জ্ঞ/g;
	$file_contents =~ s/t\/\//ৎ/g;
	$file_contents =~ s/n\/kh/ঙ্খ/g;
	$file_contents =~ s/n\/gh/ঙ্ঘ/g;
	$file_contents =~ s/n\/ch/ঞ্ছ/g;
	$file_contents =~ s/n\/jh/ঞ্ঝ/g;
	$file_contents =~ s/n\/k/ঙ্ক/g;
	$file_contents =~ s/n\/g/ঙ্গ/g;
	$file_contents =~ s/n\/c/ঞ্চ/g;
	$file_contents =~ s/n\/j/ঞ্জ/g;
	$file_contents =~ s/n\/H/হ্ন/g;
	$file_contents =~ s/N\/H/হ্ণ/g;

	$file_contents =~ s/kh/খ/g;
	$file_contents =~ s/gh/ঘ/g;
	$file_contents =~ s/ch/ছ/g;
	$file_contents =~ s/jh/ঝ/g;
	$file_contents =~ s/Th/ঠ/g;
	$file_contents =~ s/Dh/ঢ/g;
	$file_contents =~ s/th/থ/g;
	$file_contents =~ s/dh/ধ/g;
	$file_contents =~ s/ph/ফ/g;
	$file_contents =~ s/bh/ভ/g;
	$file_contents =~ s/sh/শ/g;
	$file_contents =~ s/Sh/ষ/g;
	$file_contents =~ s/rhh/ঢ়/g;
	$file_contents =~ s/rh/ড়/g;
	$file_contents =~ s/h/ঃ/g;

	$file_contents =~ s/\//্/g;
	$file_contents =~ s/oi/ৈ/g;
	$file_contents =~ s/OI/ঐ/g;
	$file_contents =~ s/eou/ৌ/g;
	$file_contents =~ s/ou/ৌ/g;
	$file_contents =~ s/OU/ঔ/g;
	$file_contents =~ s/uu/ূ/g;
	$file_contents =~ s/u/ু/g;
	$file_contents =~ s/ea/ো/g;
	$file_contents =~ s/e/ে/g;
	$file_contents =~ s/aa/ে/g;
	$file_contents =~ s/Aa/আ/g;
	$file_contents =~ s/a/া/g;
	$file_contents =~ s/o//g;

	$file_contents =~ s/AA/এ/g;
	$file_contents =~ s/A/অ/g;
	$file_contents =~ s/b/ব/g;
	$file_contents =~ s/c/চ/g;
	$file_contents =~ s/d/দ/g;
	$file_contents =~ s/D/ড/g;
	$file_contents =~ s/E/এ/g;
	$file_contents =~ s/g/গ/g;
	$file_contents =~ s/H/হ/g;
	$file_contents =~ s/ii/ী/g;
	$file_contents =~ s/i/ি/g;
	$file_contents =~ s/II/ঈ/g;
	$file_contents =~ s/I/ই/g;
	$file_contents =~ s/j/জ/g;
	$file_contents =~ s/J/য/g;
	$file_contents =~ s/k/ক/g;
	$file_contents =~ s/l/ল/g;
	$file_contents =~ s/L/্ল/g;
	$file_contents =~ s/m/ম/g;
	$file_contents =~ s/M/্ম/g;
	$file_contents =~ s/n/ন/g;
	$file_contents =~ s/N/ণ/g;
	$file_contents =~ s/O/ও/g;
	$file_contents =~ s/p/প/g;
	$file_contents =~ s/rR/ৃ/g;
	$file_contents =~ s/r/র/g;
	$file_contents =~ s/RR/ঋ/g;
	$file_contents =~ s/R/্র/g;
	$file_contents =~ s/s/স/g;
	$file_contents =~ s/t/ত/g;
	$file_contents =~ s/T/ট/g;
	$file_contents =~ s/UU/ঊ/g;
	$file_contents =~ s/U/উ/g;
	$file_contents =~ s/W/্ব/g;
	$file_contents =~ s/y/য়/g;
	$file_contents =~ s/Y/্য/g;

	$file_contents =~ s/ঁা/াঁ/g;
	$file_contents =~ s/ঁ্যা/্যাঁ/g;
	$file_contents =~ s/ঁি/িঁ/g;
	$file_contents =~ s/ঁী/ীঁ/g;
	$file_contents =~ s/ঁু/ুঁ/g;
	$file_contents =~ s/ঁূ/ূঁ/g;
	$file_contents =~ s/ঁৃ/ৃঁ/g;
	$file_contents =~ s/ঁে/েঁ/g;
	$file_contents =~ s/ঁৈ/ৈঁ/g;
	$file_contents =~ s/ঁো/োঁ/g;
	$file_contents =~ s/ঁৌ/ৌঁ/g;

	$file_contents =~ s/\./।/g;
	$file_contents =~ s/।।।/\.\.\./g;
	$file_contents =~ s/---/—/g;
	$file_contents =~ s/--/–/g;
	$file_contents =~ s/\`\`/❝/g;
	$file_contents =~ s/\`/❛/g;
	$file_contents =~ s/\'\'/❞/g;
	$file_contents =~ s/\'/❜/g;;
	$file_contents =~ s/\"/❞❯/g;

	$file_contents =~ s/0/০/g;
	$file_contents =~ s/1/১/g;
	$file_contents =~ s/2/২/g;
	$file_contents =~ s/3/৩/g;
	$file_contents =~ s/4/৪/g;
	$file_contents =~ s/5/৫/g;
	$file_contents =~ s/6/৬/g;
	$file_contents =~ s/7/৭/g;
	$file_contents =~ s/8/৮/g;
	$file_contents =~ s/9/৯/g;

	# Replace the placeholders with the unaffected text
	while ($file_contents =~ /\Q$placeholder\E/g) {
	$file_contents =~ s/\Q$&\E/$matches[0]/;
	shift @matches;
	}

	# Write out
	print $fh_out $file_contents;