Skip to content

Instantly share code, notes, and snippets.

akhan619 /
Last active October 31, 2023 10:22
Exploring Tokenizers from Hugging Face

Exploring Tokenizers from Hugging Face

Hugging Face (HF) has made NLP (Natural Language Processing) a breeze. In this post, we are going to take a look at tokenization using a hands on approach with the help of the Tokenizers library. We are going to load a real world dataset containing 10-K filings of public firms and see how to train a tokenizer from scratch based on the BERT tokenization scheme. In the process we will understand tokenization in detail and some gotchas to keep an eye out for.

Background on NLP (Optional)

If you already have an understanding of the NLP pipeline, you can safely skip this section.

For any NLP task, one of the first steps is pre-processing the data so that it can be fed into our NLP models. For those new to NLP, the general pipeline for any NLP task (text classification, question answering, etc.) is as follows:

manuelmazzuola / restclient.go
Last active July 27, 2023 12:48
how to implement a RESTClientGetter for helm action pkg
package main
import (
willprice /
Last active February 8, 2023 21:27
Install OpenCV 4.1.2 for Raspberry Pi 3 or 4 (Raspbian Buster)

Install OpenCV 4.1.2 on Raspbian Buster

$ chmod +x *.sh
$ ./
$ ./
$ ./
$ cd ~/opencv/opencv-4.1.2/build
$ sudo make install
mozillazg /
Created December 5, 2017 00:06
A simple demo for how to use flask-paginate.
from flask import Flask, render_template
from flask_paginate import Pagination, get_page_args
app = Flask(__name__)
app.template_folder = ''
users = list(range(100))
def get_users(offset=0, per_page=10):
simonw /
Last active May 30, 2024 00:39
How to recover lost Python source code if it's still resident in-memory

How to recover lost Python source code if it's still resident in-memory

I screwed up using git ("git checkout --" on the wrong file) and managed to delete the code I had just written... but it was still running in a process in a docker container. Here's how I got it back, using and

Attach a shell to the docker container

Install GDB (needed by pyrasite)

apt-get update && apt-get install gdb
joepie91 / index.js
Last active June 23, 2023 23:42
Breaking CloudFlare's "I'm Under Attack" challenge
'use strict';
const parseExpression = require("./parse-expression");
function findAll(regex, target) {
let results = [], match;
while (match = regex.exec(target)) {
markwallsgrove / go-update.go
Created February 18, 2016 20:21
go-update tutorial, fast start
package main
// This gist documents go-install using a SHA-256 checksum,
// elliptic curve (prime 256) encrypted signature & bsdiff
// formatted patch.
// These steps took me over a hour to figure. go-install currently
// doesn't include a tutorial or quick start guide, so I created this
shmup /
Last active June 8, 2024 16:32
transmission blocklist guide

Transmission Blocklist

The Transmission torrent client has an option to set a Blocklist, which helps protect you from getting caught and having the DMCA send a letter/email.

It's as simple as downloading and installing the latest client:

abhishektomar /
Last active October 12, 2022 10:02
Bash Script to Install Elastic Search, Logstash and Kibana
# Checking whether user has enough permission to run this script
sudo -n true
if [ $? -ne 0 ]
echo "This script requires user to have passwordless sudo access"
lmars / commands.txt
Last active January 22, 2016 20:43
Flynn Redis
# create a redis app
flynn create --remote "" redis
# create a release using the latest (at the time of writing) Docker Redis image
flynn -a redis release add -f config.json ""
# scale the server to one process. This may time out initially as the server pulls the image, but watch "flynn -a redis ps" and should come up.
flynn -a redis scale server=1
# redis should now be running in the cluster at redis.discoverd:6379