Skip to content

Instantly share code, notes, and snippets.

View santhoshtr's full-sized avatar
👷‍♂️
Work in progress

Santhosh Thottingal santhoshtr

👷‍♂️
Work in progress
View GitHub Profile
@santhoshtr
santhoshtr / malayalam-tokens-gemma-7b.txt
Created February 22, 2024 03:54
Malayalam tokens in Gemma 7B Model
"ന്": 26465,
"ക്": 28298,
"ത്": 31691,
"ക്ക": 41627,
"ന്ന": 45828,
"▁പ": 46110,
"▁ക": 49867,
"തി": 50292,
"്ട": 52078,
"ും: 55511,
@santhoshtr
santhoshtr / mtcoverage.py
Created November 20, 2023 04:54
Print a Tab seperated file with all languages supported by MT providers of WMF Cxserver
# Print a Tab seperated file with all languages supported by MT providers
import requests
from typing import List
mtlabels = {
"Apertium": "Ⓐ",
"Elia": "Ⓔ",
"Google": "Ⓖ",
"MinT": "Ⓜ",
"Yandex": "Ⓨ",
"LingoCloud": "Ⓛ",
@santhoshtr
santhoshtr / sentence terminators.txt
Created October 30, 2023 11:37
Sentence terminators
‭ ! U+00021 BC=ON BLK=Basic_Latin SC=Common EXCLAMATION MARK
‭ ? U+0003F BC=ON BLK=Basic_Latin SC=Common QUESTION MARK
‭ ։ U+00589 BC=L BLK=Armenian SC=Armenian ARMENIAN FULL STOP
‭ ؝ U+0061D BC=AL BLK=Arabic SC=Arabic ARABIC END OF TEXT MARK
‭ ؞ U+0061E BC=AL BLK=Arabic SC=Arabic ARABIC TRIPLE DOT PUNCTUATION MARK
‭ ؟ U+0061F BC=AL BLK=Arabic SC=Common ARABIC QUESTION MARK
‭ ۔ U+006D4 BC=AL BLK=Arabic SC=Arabic ARABIC FULL STOP
‭ ܀ U+00700 BC=AL BLK=Syriac SC=Syriac SYRIAC END OF PARAGRAPH
‭ ܁ U+00701 BC=AL BLK=Syriac SC=Syriac SYRIAC SUPRALINEAR FULL STOP
‭ ܂ U+00702 BC=AL BLK=Syriac SC=Syriac SYRIAC SUBLINEAR FULL STOP
@santhoshtr
santhoshtr / lid.md
Created June 23, 2023 06:31
Language identification - notes
@santhoshtr
santhoshtr / manjari-manjula.md
Created November 25, 2022 07:13
Difference between Manjari and Manjula
Manjari Manjula
Maintained by the designer Maintainer unknown
Updates are available Since there is no maintainer, updates are not expected
Source code is available Only ttf binary is available. Script to convert original Manjari-Regular variant to this ttf is also available
OTF, TTF, Webfont version of fonts are provided. OTF is close to the design. TTF is quadratic curve approximation Only TTF version is provided
Regular, Bold, Thin variants are available Only Regular is provided
Public issue tracker is available No issue tracker
Contains large set of glyphs. With Opentype rules, 1971 style or 2022 style can be used A subset of glyphs to support Government Script Reformation 2022 is available. Note that the font has all glyphs, but the code to form them is removed. So font file size is unnecessarily bigger than required
@santhoshtr
santhoshtr / corpus-cleanup-malayalam.sed
Created February 28, 2020 10:38
Malayalam corpus cleanup script
# Misc clean up on corpus
# sed -i -f corpora-cleanup.sed corpus/*.txt
# Chillu normalization
s/ന്‍/ൻ/g
s/ള്‍/ൾ/g
s/ല്‍/ൽ/g
s/ര്‍/ർ/g
s/ന്‍/ൻ/g
s/ണ്‍/ൺ/g
# Remove ZWNJ at end of words
@santhoshtr
santhoshtr / process.js
Created February 21, 2020 05:59
process ligatures and glyphs for Manjari
const glyphs = require('./glyphs.json').glyphs
const ligatures = require('./ligatures.json').ligatures
const getGlyphValue = (glyphname) => {
const glyph = glyphs.find(g => g.glyph === glyphname);
return glyph && glyph.value;
}
const process = () => {
const ligaturesLength = ligatures.length;
@santhoshtr
santhoshtr / KeralaPRDHeadlinesCrawler.py
Created October 26, 2019 07:27
Crawl Kerala PRD website and download all content to json
import scrapy
from scrapy.http import Request
class HeadlineCatcher(scrapy.Spider):
name = "headlinecatcher"
start_urls = ["http://www.prd.kerala.gov.in/pressrelease"]
custom_settings = {
'FEED_EXPORT_ENCODING': 'utf-8',
}

Keybase proof

I hereby claim:

  • I am santhoshtr on github.
  • I am sthottingal (https://keybase.io/sthottingal) on keybase.
  • I have a public key ASAP_nrhFC103eL1sF9vFA9M4mrxkfvudZ2I-Bd9kiukOgo

To claim this, I am signing this object:

@santhoshtr
santhoshtr / hd-playlist-audio-downloader.sh
Created July 9, 2018 11:23
HD Audio download from a youtube playlist
youtube-dl -f bestaudio --extract-audio --audio-format mp3 --audio-quality 0 -o "%(title)s.%(ext)s" https://www.youtube.com/playlist?list=abdlshfjskdhfuwhrklk