Shantanu Oak shantanuo

## python_version_history.ipynb

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                shantanuo
                / python_version_history.ipynb
            
            
              Created
              March 27, 2024 06:18
                — forked from justmarkham/python_version_history.ipynb
            
              
                Data School blog post
              
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## tokenizers.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                shantanuo
                / tokenizers.md
            
            
              Created
              January 30, 2023 06:29
                — forked from akhan619/tokenizers.md
            
              
                Exploring Tokenizers from Hugging Face
              
          
    Exploring Tokenizers from Hugging Face

Hugging Face (HF) has made NLP (Natural Language Processing) a breeze. In this post, we are going to take a look at tokenization using a hands on approach with the help of the Tokenizers library. We are going to load a real world dataset containing 10-K filings of public firms and see how to train a tokenizer from scratch based on the BERT tokenization scheme. In the process we will understand tokenization in detail and some gotchas to keep an eye out for.
Background on NLP (Optional)

If you already have an understanding of the NLP pipeline, you can safely skip this section.
For any NLP task, one of the first steps is pre-processing the data so that it can be fed into our NLP models. For those new to NLP, the general pipeline for any NLP task (text classification, question answering, etc.) is as follows:

  
## all_summary.py
summary = [article['summary'] for article in articles]
sentence = summary[0]

## index.html
<a href="javascript:(function(){Array.from(document.querySelectorAll('textarea')).map(function(b){var a=document.createElement('div');var d=document.createElement('button');d.textContent='↑';d.addEventListener('click',function(f){f.preventDefault();b.value=Array.from(new Set(b.value.split('\n'))).sort().join('\n')});var c=document.createElement('button');c.textContent='↓';c.addEventListener('click',function(f){f.preventDefault();b.value=Array.from(new Set(b.value.split('\n'))).sort().reverse().join('\n')});a.appendChild(d);a.appendChild(c);b.parentNode.insertBefore(a,b)})})();">Sort textarea unique</a>

## cache_example.py
import streamlit as st
import pandas as pd

# Reuse this data across runs!
read_and_cache_csv = st.cache(pd.read_csv)

BUCKET = "https://streamlit-self-driving.s3-us-west-2.amazonaws.com/"
data = read_and_cache_csv(BUCKET + "labels.csv.gz", nrows=1000)
desired_label = st.selectbox('Filter to:', ['car', 'truck'])
st.write(data[data.label == desired_label])

## model.py
clf = Pipeline([("dct", DictVectorizer()), ("svc", LinearSVC())])
params = {
    "svc__C": [1e15, 1e13, 1e11, 1e9, 1e7, 1e5, 1e3, 1e1, 1e-1, 1e-3, 1e-5]
}
gs = GridSearchCV(clf, params, cv=10, verbose=2, n_jobs=-1)
gs.fit(X, y)
model = gs.best_estimator_

## mongo-ls.js
// Usage: mongo {Server without mongodb:// example 127.0.0.1:27017}/{DbName} [-u {Username}] [-p {Password}] < ./mongo-ls.js

var collections = db.getCollectionNames();

print('Collections inside the db:');
for(var i = 0; i < collections.length; i++){
  var name = collections[i];

  if(name.substr(0, 6) != 'system')
    print(name + ' - ' + db[name].count() + ' records');

## tree.markdown

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                shantanuo
                / tree.markdown
            
            
              Created
              December 8, 2016 12:54
                — forked from matthandlersux/tree.markdown
            
              
                one-line tree datastructure in python
              
          
    in python:

from pprint import pprint
from collections import defaultdict


# one-line definition of a tree:
def tree(): return defaultdict(tree)

  
## tree.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                shantanuo
                / tree.md
            
            
              Created
              December 8, 2016 12:48
                — forked from hrldcpr/tree.md
            
              
                one-line tree in python
              
          
    One-line Tree in Python

Using Python's built-in defaultdict we can easily define a tree data structure:
def tree(): return defaultdict(tree)
That's it!

  
## natural_language_time_periods.py
import datetime
from dateutil.relativedelta import relativedelta

def today(date=None,iso=False):
    if date is None: date=datetime.date.today()
    if iso: return date.isoformat()
    else: return date

def yesterday(date=None,iso=False):
    if date is None: date=today()
	summary = [article['summary'] for article in articles]
	sentence = summary[0]
	import streamlit as st
	import pandas as pd

	# Reuse this data across runs!
	read_and_cache_csv = st.cache(pd.read_csv)

	BUCKET = "https://streamlit-self-driving.s3-us-west-2.amazonaws.com/"
	data = read_and_cache_csv(BUCKET + "labels.csv.gz", nrows=1000)
	desired_label = st.selectbox('Filter to:', ['car', 'truck'])
	st.write(data[data.label == desired_label])
	clf = Pipeline([("dct", DictVectorizer()), ("svc", LinearSVC())])
	params = {
	"svc__C": [1e15, 1e13, 1e11, 1e9, 1e7, 1e5, 1e3, 1e1, 1e-1, 1e-3, 1e-5]
	}
	gs = GridSearchCV(clf, params, cv=10, verbose=2, n_jobs=-1)
	gs.fit(X, y)
	model = gs.best_estimator_
	// Usage: mongo {Server without mongodb:// example 127.0.0.1:27017}/{DbName} [-u {Username}] [-p {Password}] < ./mongo-ls.js

	var collections = db.getCollectionNames();

	print('Collections inside the db:');
	for(var i = 0; i < collections.length; i++){
	var name = collections[i];

	if(name.substr(0, 6) != 'system')
	print(name + ' - ' + db[name].count() + ' records');
	import datetime
	from dateutil.relativedelta import relativedelta

	def today(date=None,iso=False):
	if date is None: date=datetime.date.today()
	if iso: return date.isoformat()
	else: return date

	def yesterday(date=None,iso=False):
	if date is None: date=today()