Skip to content

Instantly share code, notes, and snippets.

@shantanuo
shantanuo / hide_wiki_users.js
Last active July 1, 2023 04:49
hide some users from "Recent Changes" page of Marathi wikipedia
# Download and install TamperMonkey add-on for firefox or chrome: https://www.tampermonkey.net/
# Add this user-script to hide the trusted users so that you can check new users
// ==UserScript==
// @name Hide usernames from the "Recent changes" page
// @version 0.1
// @author double-beep
// @description test this
// @match https://mr.wikipedia.org/wiki/%E0%A4%B5%E0%A4%BF%E0%A4%B6%E0%A5%87%E0%A4%B7:%E0%A4%85%E0%A4%B2%E0%A5%80%E0%A4%95%E0%A4%A1%E0%A5%80%E0%A4%B2_%E0%A4%AC%E0%A4%A6%E0%A4%B2*
// @match https://mr.wikisource.org/wiki/%E0%A4%B5%E0%A4%BF%E0%A4%B6%E0%A5%87%E0%A4%B7:%E0%A4%85%E0%A4%B2%E0%A5%80%E0%A4%95%E0%A4%A1%E0%A5%80%E0%A4%B2_%E0%A4%AC%E0%A4%A6%E0%A4%B2*
@shantanuo
shantanuo / sanskrit_sandhi
Created May 14, 2023 09:55
sandhi reference table based on panini rules
a,b,c,d
ृ,अ,र,244509
ृ,अं,रं,6280
ृ,आ,रा,37141
ृ,इ,रि,21326
ृ,ई,री,9213
ृ,उ,रु,11076
ृ,ऊ,रू,5206
ृ,ए,रे,12990
ृ,ऐ,रै,1739
@shantanuo
shantanuo / new_lang.aff
Created February 8, 2023 03:35
affix file for a new language
SET UTF-8
NOSPLITSUGS
PFX A Y 1
PFX A 0 a .
PFX B Y 1
PFX B 0 b .
PFX C Y 1
@shantanuo
shantanuo / tokenizers.md
Created January 30, 2023 06:29 — forked from akhan619/tokenizers.md
Exploring Tokenizers from Hugging Face

Exploring Tokenizers from Hugging Face

Hugging Face (HF) has made NLP (Natural Language Processing) a breeze. In this post, we are going to take a look at tokenization using a hands on approach with the help of the Tokenizers library. We are going to load a real world dataset containing 10-K filings of public firms and see how to train a tokenizer from scratch based on the BERT tokenization scheme. In the process we will understand tokenization in detail and some gotchas to keep an eye out for.

Background on NLP (Optional)

If you already have an understanding of the NLP pipeline, you can safely skip this section.

For any NLP task, one of the first steps is pre-processing the data so that it can be fed into our NLP models. For those new to NLP, the general pipeline for any NLP task (text classification, question answering, etc.) is as follows:

@shantanuo
shantanuo / mysql_to_big_query.sh
Last active September 14, 2022 07:12
Copy MySQL table to big query. If you need to copy all tables, use the loop given at the end. Exit with error code 3 if blob or text columns are found. The csv files are first copied to google cloud before being imported to big query.
#!/bin/sh
TABLE_SCHEMA=$1
TABLE_NAME=$2
mytime=`date '+%y%m%d%H%M'`
hostname=`hostname | tr 'A-Z' 'a-z'`
file_prefix="trimax$TABLE_NAME$mytime$TABLE_SCHEMA"
bucket_name=$file_prefix
splitat="4000000000"
bulkfiles=200
@shantanuo
shantanuo / mysql_debug.sh
Last active June 1, 2022 03:19
gather and send mysql statistics to be analysed
#!/bin/sh
user='root'
password='company'
adminmail='s.o@gmail.com'
> to_study.txt
> to_study_err.txt
mysqladmin -u$user -p$password debug
errorlog=`mysqladmin variables | grep log_error | awk '{print $4}'`
@shantanuo
shantanuo / change_format.txt
Created February 7, 2022 08:59
उभी जोडाक्षरे आडवी करण्यासाठी लिब्रे ऑफिसच्या रायटरमधील मॅक्रो
REM ***** BASIC *****
Sub GaMa_jodakshar
Dim oDoc As Variant
Dim oSearchDescriptor As Variant
Dim oDictionary As Variant
Dim oReplace As Variant
Dim oStatusIndicator As Variant
Dim i As Long, j As Long
oDoc = ThisComponent
1) Copy hindi itrans file to marathi gamabhana
a) sudo su
b) cd /usr/share/m17n/
c) cp hi-itrans.mim mr-gamabhana.mim
2) Open the file mr-gamabhana.mim and make the following 3 changes:
a) Change the following line to change the language from hindi to marathi
Old:
(input-method hi itrans)
New:
@shantanuo
shantanuo / all_summary.py
Created December 30, 2021 09:50 — forked from NewscatcherAPI/all_summary.py
spacy_vs_nltk_newscatcher_blog
summary = [article['summary'] for article in articles]
sentence = summary[0]
@shantanuo
shantanuo / error_while_resotring_xml.txt
Created August 2, 2021 12:55
The content model 'Scribunto' is not registered on this wiki.
root@aace30d9b5f3:/var/www/html/mediawiki-1.36.1# time php ./maintenance/importDump.php mrwiki-latest-pages-articles-multistream.xml
91800 (4.61 pages/sec 4.61 revs/sec)
91900 (4.61 pages/sec 4.61 revs/sec)
92000 (4.61 pages/sec 4.61 revs/sec)
92100 (4.62 pages/sec 4.62 revs/sec)
92200 (4.61 pages/sec 4.61 revs/sec)
MWUnknownContentModelException from line 201 of /var/www/html/mediawiki-1.36.1/includes/content/ContentHandlerFactory.php: The content model 'Scribunto' is not registered on this wiki.
See https://www.mediawiki.org/wiki/Content_handlers to find out which extensions handle this content model.
#0 /var/www/html/mediawiki-1.36.1/includes/content/ContentHandlerFactory.php(270): MediaWiki\Content\ContentHandlerFactory->validateContentHandler('Scribunto', NULL)