Skip to content

Instantly share code, notes, and snippets.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@shantanuo
shantanuo / tokenizers.md
Created January 30, 2023 06:29 — forked from akhan619/tokenizers.md
Exploring Tokenizers from Hugging Face

Exploring Tokenizers from Hugging Face

Hugging Face (HF) has made NLP (Natural Language Processing) a breeze. In this post, we are going to take a look at tokenization using a hands on approach with the help of the Tokenizers library. We are going to load a real world dataset containing 10-K filings of public firms and see how to train a tokenizer from scratch based on the BERT tokenization scheme. In the process we will understand tokenization in detail and some gotchas to keep an eye out for.

Background on NLP (Optional)

If you already have an understanding of the NLP pipeline, you can safely skip this section.

For any NLP task, one of the first steps is pre-processing the data so that it can be fed into our NLP models. For those new to NLP, the general pipeline for any NLP task (text classification, question answering, etc.) is as follows:

@shantanuo
shantanuo / all_summary.py
Created December 30, 2021 09:50 — forked from NewscatcherAPI/all_summary.py
spacy_vs_nltk_newscatcher_blog
summary = [article['summary'] for article in articles]
sentence = summary[0]
@shantanuo
shantanuo / index.html
Last active May 17, 2021 00:50 — forked from aquilax/index.html
Sort textarea unique
<a href="javascript:(function(){Array.from(document.querySelectorAll('textarea')).map(function(b){var a=document.createElement('div');var d=document.createElement('button');d.textContent='↑';d.addEventListener('click',function(f){f.preventDefault();b.value=Array.from(new Set(b.value.split('\n'))).sort().join('\n')});var c=document.createElement('button');c.textContent='↓';c.addEventListener('click',function(f){f.preventDefault();b.value=Array.from(new Set(b.value.split('\n'))).sort().reverse().join('\n')});a.appendChild(d);a.appendChild(c);b.parentNode.insertBefore(a,b)})})();">Sort textarea unique</a>
@shantanuo
shantanuo / cache_example.py
Created October 28, 2019 09:55 — forked from treuille/cache_example.py
This demonstrates the st.cache function
import streamlit as st
import pandas as pd
# Reuse this data across runs!
read_and_cache_csv = st.cache(pd.read_csv)
BUCKET = "https://streamlit-self-driving.s3-us-west-2.amazonaws.com/"
data = read_and_cache_csv(BUCKET + "labels.csv.gz", nrows=1000)
desired_label = st.selectbox('Filter to:', ['car', 'truck'])
st.write(data[data.label == desired_label])
clf = Pipeline([("dct", DictVectorizer()), ("svc", LinearSVC())])
params = {
"svc__C": [1e15, 1e13, 1e11, 1e9, 1e7, 1e5, 1e3, 1e1, 1e-1, 1e-3, 1e-5]
}
gs = GridSearchCV(clf, params, cv=10, verbose=2, n_jobs=-1)
gs.fit(X, y)
model = gs.best_estimator_
@shantanuo
shantanuo / mongo-ls.js
Created April 24, 2019 05:23 — forked from matteofigus/mongo-ls.js
A script to list all the collections and document count for a specific mongodb db
// Usage: mongo {Server without mongodb:// example 127.0.0.1:27017}/{DbName} [-u {Username}] [-p {Password}] < ./mongo-ls.js
var collections = db.getCollectionNames();
print('Collections inside the db:');
for(var i = 0; i < collections.length; i++){
var name = collections[i];
if(name.substr(0, 6) != 'system')
print(name + ' - ' + db[name].count() + ' records');
@shantanuo
shantanuo / tree.markdown
Created December 8, 2016 12:54 — forked from matthandlersux/tree.markdown
one-line tree datastructure in python

in python:

from pprint import pprint
from collections import defaultdict


# one-line definition of a tree:
def tree(): return defaultdict(tree)
@shantanuo
shantanuo / tree.md
Created December 8, 2016 12:48 — forked from hrldcpr/tree.md
one-line tree in python

One-line Tree in Python

Using Python's built-in defaultdict we can easily define a tree data structure:

def tree(): return defaultdict(tree)

That's it!

@shantanuo
shantanuo / natural_language_time_periods.py
Created September 28, 2016 13:23 — forked from psychemedia/natural_language_time_periods.py
Simple python functions to give dates and date ranges in "natural time"; this week, next month, etc.
import datetime
from dateutil.relativedelta import relativedelta
def today(date=None,iso=False):
if date is None: date=datetime.date.today()
if iso: return date.isoformat()
else: return date
def yesterday(date=None,iso=False):
if date is None: date=today()