Skip to content

Instantly share code, notes, and snippets.

@shantanuo
shantanuo / scan.py
Created November 27, 2023 08:55
GIta Repeat phrases
# copy paste plain text from site to file gg.txt
# https://sanskritdocuments.org/doc_giitaa/bhagvadnew.html
# Bhagvad Gita takes around 4 hours to scan
import re, sys
with open('gg.txt') as f:
mytext = f.read()
mytext2 = re.sub('\n', ' ', mytext)
@shantanuo
shantanuo / sanskrit_sandhi
Created May 14, 2023 09:55
sandhi reference table based on panini rules
a,b,c,d
ृ,अ,र,244509
ृ,अं,रं,6280
ृ,आ,रा,37141
ृ,इ,रि,21326
ृ,ई,री,9213
ृ,उ,रु,11076
ृ,ऊ,रू,5206
ृ,ए,रे,12990
ृ,ऐ,रै,1739
@shantanuo
shantanuo / new_lang.aff
Created February 8, 2023 03:35
affix file for a new language
SET UTF-8
NOSPLITSUGS
PFX A Y 1
PFX A 0 a .
PFX B Y 1
PFX B 0 b .
PFX C Y 1
@shantanuo
shantanuo / tokenizers.md
Created January 30, 2023 06:29 — forked from akhan619/tokenizers.md
Exploring Tokenizers from Hugging Face

Exploring Tokenizers from Hugging Face

Hugging Face (HF) has made NLP (Natural Language Processing) a breeze. In this post, we are going to take a look at tokenization using a hands on approach with the help of the Tokenizers library. We are going to load a real world dataset containing 10-K filings of public firms and see how to train a tokenizer from scratch based on the BERT tokenization scheme. In the process we will understand tokenization in detail and some gotchas to keep an eye out for.

Background on NLP (Optional)

If you already have an understanding of the NLP pipeline, you can safely skip this section.

For any NLP task, one of the first steps is pre-processing the data so that it can be fed into our NLP models. For those new to NLP, the general pipeline for any NLP task (text classification, question answering, etc.) is as follows:

@shantanuo
shantanuo / hide_wiki_users.js
Last active July 1, 2023 04:49
hide some users from "Recent Changes" page of Marathi wikipedia
# Download and install TamperMonkey add-on for firefox or chrome: https://www.tampermonkey.net/
# Add this user-script to hide the trusted users so that you can check new users
// ==UserScript==
// @name Hide usernames from the "Recent changes" page
// @version 0.1
// @author double-beep
// @description test this
// @match https://mr.wikipedia.org/wiki/%E0%A4%B5%E0%A4%BF%E0%A4%B6%E0%A5%87%E0%A4%B7:%E0%A4%85%E0%A4%B2%E0%A5%80%E0%A4%95%E0%A4%A1%E0%A5%80%E0%A4%B2_%E0%A4%AC%E0%A4%A6%E0%A4%B2*
// @match https://mr.wikisource.org/wiki/%E0%A4%B5%E0%A4%BF%E0%A4%B6%E0%A5%87%E0%A4%B7:%E0%A4%85%E0%A4%B2%E0%A5%80%E0%A4%95%E0%A4%A1%E0%A5%80%E0%A4%B2_%E0%A4%AC%E0%A4%A6%E0%A4%B2*
@shantanuo
shantanuo / change_format.txt
Created February 7, 2022 08:59
उभी जोडाक्षरे आडवी करण्यासाठी लिब्रे ऑफिसच्या रायटरमधील मॅक्रो
REM ***** BASIC *****
Sub GaMa_jodakshar
Dim oDoc As Variant
Dim oSearchDescriptor As Variant
Dim oDictionary As Variant
Dim oReplace As Variant
Dim oStatusIndicator As Variant
Dim i As Long, j As Long
oDoc = ThisComponent
1) Copy hindi itrans file to marathi gamabhana
a) sudo su
b) cd /usr/share/m17n/
c) cp hi-itrans.mim mr-gamabhana.mim
2) Open the file mr-gamabhana.mim and make the following 3 changes:
a) Change the following line to change the language from hindi to marathi
Old:
(input-method hi itrans)
New:
@shantanuo
shantanuo / all_summary.py
Created December 30, 2021 09:50 — forked from NewscatcherAPI/all_summary.py
spacy_vs_nltk_newscatcher_blog
summary = [article['summary'] for article in articles]
sentence = summary[0]
@shantanuo
shantanuo / error_while_resotring_xml.txt
Created August 2, 2021 12:55
The content model 'Scribunto' is not registered on this wiki.
root@aace30d9b5f3:/var/www/html/mediawiki-1.36.1# time php ./maintenance/importDump.php mrwiki-latest-pages-articles-multistream.xml
91800 (4.61 pages/sec 4.61 revs/sec)
91900 (4.61 pages/sec 4.61 revs/sec)
92000 (4.61 pages/sec 4.61 revs/sec)
92100 (4.62 pages/sec 4.62 revs/sec)
92200 (4.61 pages/sec 4.61 revs/sec)
MWUnknownContentModelException from line 201 of /var/www/html/mediawiki-1.36.1/includes/content/ContentHandlerFactory.php: The content model 'Scribunto' is not registered on this wiki.
See https://www.mediawiki.org/wiki/Content_handlers to find out which extensions handle this content model.
#0 /var/www/html/mediawiki-1.36.1/includes/content/ContentHandlerFactory.php(270): MediaWiki\Content\ContentHandlerFactory->validateContentHandler('Scribunto', NULL)
@shantanuo
shantanuo / index.html
Last active May 17, 2021 00:50 — forked from aquilax/index.html
Sort textarea unique
<a href="javascript:(function(){Array.from(document.querySelectorAll('textarea')).map(function(b){var a=document.createElement('div');var d=document.createElement('button');d.textContent='↑';d.addEventListener('click',function(f){f.preventDefault();b.value=Array.from(new Set(b.value.split('\n'))).sort().join('\n')});var c=document.createElement('button');c.textContent='↓';c.addEventListener('click',function(f){f.preventDefault();b.value=Array.from(new Set(b.value.split('\n'))).sort().reverse().join('\n')});a.appendChild(d);a.appendChild(c);b.parentNode.insertBefore(a,b)})})();">Sort textarea unique</a>