Shantanu Oak shantanuo

## scan.py
# copy paste plain text from site to file gg.txt
# https://sanskritdocuments.org/doc_giitaa/bhagvadnew.html
# Bhagvad Gita takes around 4 hours to scan

import re, sys

with open('gg.txt') as f:
    mytext = f.read()

mytext2 = re.sub('\n', ' ', mytext)

## sanskrit_sandhi
a,b,c,d
ृ,अ,र,244509
ृ,अं,रं,6280
ृ,आ,रा,37141
ृ,इ,रि,21326
ृ,ई,री,9213
ृ,उ,रु,11076
ृ,ऊ,रू,5206
ृ,ए,रे,12990
ृ,ऐ,रै,1739

## new_lang.aff
SET UTF-8
NOSPLITSUGS

PFX A Y	1
PFX A 0 a .

PFX B Y	1
PFX B 0 b .

PFX C Y	1

## tokenizers.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                shantanuo
                / tokenizers.md
            
            
              Created
              January 30, 2023 06:29
                — forked from akhan619/tokenizers.md
            
              
                Exploring Tokenizers from Hugging Face
              
          
    Exploring Tokenizers from Hugging Face

Hugging Face (HF) has made NLP (Natural Language Processing) a breeze. In this post, we are going to take a look at tokenization using a hands on approach with the help of the Tokenizers library. We are going to load a real world dataset containing 10-K filings of public firms and see how to train a tokenizer from scratch based on the BERT tokenization scheme. In the process we will understand tokenization in detail and some gotchas to keep an eye out for.
Background on NLP (Optional)

If you already have an understanding of the NLP pipeline, you can safely skip this section.
For any NLP task, one of the first steps is pre-processing the data so that it can be fed into our NLP models. For those new to NLP, the general pipeline for any NLP task (text classification, question answering, etc.) is as follows:

  
## hide_wiki_users.js
# Download and install TamperMonkey add-on for firefox or chrome: https://www.tampermonkey.net/
# Add this user-script to hide the trusted users so that you can check new users

// ==UserScript==
// @name         Hide usernames from the "Recent changes" page
// @version      0.1
// @author       double-beep
// @description   test this
// @match        https://mr.wikipedia.org/wiki/%E0%A4%B5%E0%A4%BF%E0%A4%B6%E0%A5%87%E0%A4%B7:%E0%A4%85%E0%A4%B2%E0%A5%80%E0%A4%95%E0%A4%A1%E0%A5%80%E0%A4%B2_%E0%A4%AC%E0%A4%A6%E0%A4%B2*
// @match        https://mr.wikisource.org/wiki/%E0%A4%B5%E0%A4%BF%E0%A4%B6%E0%A5%87%E0%A4%B7:%E0%A4%85%E0%A4%B2%E0%A5%80%E0%A4%95%E0%A4%A1%E0%A5%80%E0%A4%B2_%E0%A4%AC%E0%A4%A6%E0%A4%B2*

## change_format.txt
REM  *****  BASIC  *****

Sub GaMa_jodakshar
Dim oDoc As Variant
Dim oSearchDescriptor As Variant
Dim oDictionary As Variant
Dim oReplace As Variant
Dim oStatusIndicator As Variant
Dim i As Long, j As Long
	oDoc = ThisComponent

## itrans_gamabhana
1) Copy hindi itrans file to marathi gamabhana
a) sudo su
b) cd /usr/share/m17n/
c) cp hi-itrans.mim mr-gamabhana.mim

2) Open the file mr-gamabhana.mim and make the following 3 changes:
a) Change the following line to change the language from hindi to marathi
Old:
(input-method hi itrans)
New:

## all_summary.py
summary = [article['summary'] for article in articles]
sentence = summary[0]

## error_while_resotring_xml.txt
root@aace30d9b5f3:/var/www/html/mediawiki-1.36.1# time php ./maintenance/importDump.php mrwiki-latest-pages-articles-multistream.xml

91800 (4.61 pages/sec 4.61 revs/sec)
91900 (4.61 pages/sec 4.61 revs/sec)
92000 (4.61 pages/sec 4.61 revs/sec)
92100 (4.62 pages/sec 4.62 revs/sec)
92200 (4.61 pages/sec 4.61 revs/sec)
MWUnknownContentModelException from line 201 of /var/www/html/mediawiki-1.36.1/includes/content/ContentHandlerFactory.php: The content model 'Scribunto' is not registered on this wiki.
See https://www.mediawiki.org/wiki/Content_handlers to find out which extensions handle this content model.
#0 /var/www/html/mediawiki-1.36.1/includes/content/ContentHandlerFactory.php(270): MediaWiki\Content\ContentHandlerFactory->validateContentHandler('Scribunto', NULL)

## index.html
<a href="javascript:(function(){Array.from(document.querySelectorAll('textarea')).map(function(b){var a=document.createElement('div');var d=document.createElement('button');d.textContent='↑';d.addEventListener('click',function(f){f.preventDefault();b.value=Array.from(new Set(b.value.split('\n'))).sort().join('\n')});var c=document.createElement('button');c.textContent='↓';c.addEventListener('click',function(f){f.preventDefault();b.value=Array.from(new Set(b.value.split('\n'))).sort().reverse().join('\n')});a.appendChild(d);a.appendChild(c);b.parentNode.insertBefore(a,b)})})();">Sort textarea unique</a>
	# copy paste plain text from site to file gg.txt
	# https://sanskritdocuments.org/doc_giitaa/bhagvadnew.html
	# Bhagvad Gita takes around 4 hours to scan

	import re, sys

	with open('gg.txt') as f:
	mytext = f.read()

	mytext2 = re.sub('\n', ' ', mytext)
	a,b,c,d
	ृ,अ,र,244509
	ृ,अं,रं,6280
	ृ,आ,रा,37141
	ृ,इ,रि,21326
	ृ,ई,री,9213
	ृ,उ,रु,11076
	ृ,ऊ,रू,5206
	ृ,ए,रे,12990
	ृ,ऐ,रै,1739
	SET UTF-8
	NOSPLITSUGS

	PFX A Y 1
	PFX A 0 a .

	PFX B Y 1
	PFX B 0 b .

	PFX C Y 1
	# Download and install TamperMonkey add-on for firefox or chrome: https://www.tampermonkey.net/
	# Add this user-script to hide the trusted users so that you can check new users

	// ==UserScript==
	// @name Hide usernames from the "Recent changes" page
	// @version 0.1
	// @author double-beep
	// @description test this
	// @match https://mr.wikipedia.org/wiki/%E0%A4%B5%E0%A4%BF%E0%A4%B6%E0%A5%87%E0%A4%B7:%E0%A4%85%E0%A4%B2%E0%A5%80%E0%A4%95%E0%A4%A1%E0%A5%80%E0%A4%B2_%E0%A4%AC%E0%A4%A6%E0%A4%B2*
	// @match https://mr.wikisource.org/wiki/%E0%A4%B5%E0%A4%BF%E0%A4%B6%E0%A5%87%E0%A4%B7:%E0%A4%85%E0%A4%B2%E0%A5%80%E0%A4%95%E0%A4%A1%E0%A5%80%E0%A4%B2_%E0%A4%AC%E0%A4%A6%E0%A4%B2*
	REM *** BASIC ***

	Sub GaMa_jodakshar
	Dim oDoc As Variant
	Dim oSearchDescriptor As Variant
	Dim oDictionary As Variant
	Dim oReplace As Variant
	Dim oStatusIndicator As Variant
	Dim i As Long, j As Long
	oDoc = ThisComponent
	1) Copy hindi itrans file to marathi gamabhana
	a) sudo su
	b) cd /usr/share/m17n/
	c) cp hi-itrans.mim mr-gamabhana.mim

	2) Open the file mr-gamabhana.mim and make the following 3 changes:
	a) Change the following line to change the language from hindi to marathi
	Old:
	(input-method hi itrans)
	New:
	summary = [article['summary'] for article in articles]
	sentence = summary[0]
	root@aace30d9b5f3:/var/www/html/mediawiki-1.36.1# time php ./maintenance/importDump.php mrwiki-latest-pages-articles-multistream.xml

	91800 (4.61 pages/sec 4.61 revs/sec)
	91900 (4.61 pages/sec 4.61 revs/sec)
	92000 (4.61 pages/sec 4.61 revs/sec)
	92100 (4.62 pages/sec 4.62 revs/sec)
	92200 (4.61 pages/sec 4.61 revs/sec)
	MWUnknownContentModelException from line 201 of /var/www/html/mediawiki-1.36.1/includes/content/ContentHandlerFactory.php: The content model 'Scribunto' is not registered on this wiki.
	See https://www.mediawiki.org/wiki/Content_handlers to find out which extensions handle this content model.
	#0 /var/www/html/mediawiki-1.36.1/includes/content/ContentHandlerFactory.php(270): MediaWiki\Content\ContentHandlerFactory->validateContentHandler('Scribunto', NULL)