Skip to content

Instantly share code, notes, and snippets.

@tamuhey
tamuhey / tokenizations_post.md
Last active June 26, 2024 01:00
How to calculate the alignment between BERT and spaCy tokens effectively and robustly

How to calculate the alignment between BERT and spaCy tokens effectively and robustly

image

site: https://tamuhey.github.io/tokenizations/

Natural Language Processing (NLP) has made great progress in recent years because of neural networks, which allows us to solve various tasks with end-to-end architecture. However, many NLP systems still require language-specific pre- and post-processing, especially in tokenizations. In this article, I describe an algorithm that simplifies calculating correspondence between tokens (e.g. BERT vs. spaCy), one such process. And I introduce Python and Rust libraries that implement this algorithm. Here are the library and the demo site links:

@disintegrator
disintegrator / GatsbyContentfulWithReadingTime.graphql
Last active June 19, 2022 13:43
Add a reading time estimate to all your Contentful rich text nodes in GatsbyJS
{
contentfulBlogPost {
body {
fields {
readingTime {
text
minutes
time
words
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@erikbern
erikbern / install-tensorflow.sh
Last active June 26, 2023 00:40
Installing TensorFlow on EC2
# Note – this is not a bash script (some of the steps require reboot)
# I named it .sh just so Github does correct syntax highlighting.
#
# This is also available as an AMI in us-east-1 (virginia): ami-cf5028a5
#
# The CUDA part is mostly based on this excellent blog post:
# http://tleyden.github.io/blog/2014/10/25/cuda-6-dot-5-on-aws-gpu-instance-running-ubuntu-14-dot-04/
# Install various packages
sudo apt-get update
@dgp
dgp / youtube api video category id list
Created June 11, 2015 05:57
youtube api video category id list
2 - Autos & Vehicles
1 - Film & Animation
10 - Music
15 - Pets & Animals
17 - Sports
18 - Short Movies
19 - Travel & Events
20 - Gaming
21 - Videoblogging
22 - People & Blogs
@johnbintz
johnbintz / simple-capistrano-docker-deploy.rb
Last active April 3, 2023 08:23
Simple Capistrano deploy for a Docker-managed app
# be sure to comment out the require 'capistrano/deploy' line in your Capfile!
# config valid only for Capistrano 3.1
lock '3.2.1'
set :application, 'my-cool-application'
# the base docker repo reference
set :name, "johns-stuff/#{fetch(:application)}"
@obmarg
obmarg / gae_workaround.py
Last active January 4, 2017 08:27
Google app engine python issue #7746 workaround
from toolz import concat
def page_iterator(query, page_size=999, **kwargs):
'''
Returns an iterator over pages of a query.
Can be used to work-around the 1000 entity limit in remote_api_shell
:params query: The query we're using.
:params page_size: The page size to return.
:params qwargs: Additional options for fetch_page
@victorquinn
victorquinn / promise_while_loop.js
Last active March 30, 2023 04:29
Promise "loop" using the Bluebird library
var Promise = require('bluebird');
var promiseWhile = function(condition, action) {
var resolver = Promise.defer();
var loop = function() {
if (!condition()) return resolver.resolve();
return Promise.cast(action())
.then(loop)
.catch(resolver.reject);
@mattetti
mattetti / gist:3798173
Last active April 16, 2023 03:09
async fetching of urls using goroutines and channels
package main
import (
"fmt"
"net/http"
"time"
)
var urls = []string{
"https://splice.com/",
@dbieber
dbieber / collapse.css
Created September 4, 2012 04:43
Collapsible Blog Posts
/* CSS for Collapsible Blog Posts */
[id^=_] {
display: none;
}