Skip to content

Instantly share code, notes, and snippets.

View tamuhey's full-sized avatar
🏠
Working from home

Yohei Tamura tamuhey

🏠
Working from home
View GitHub Profile
@tamuhey
tamuhey / tokenizations_post.md
Last active March 30, 2024 19:00
How to calculate the alignment between BERT and spaCy tokens effectively and robustly

How to calculate the alignment between BERT and spaCy tokens effectively and robustly

image

site: https://tamuhey.github.io/tokenizations/

Natural Language Processing (NLP) has made great progress in recent years because of neural networks, which allows us to solve various tasks with end-to-end architecture. However, many NLP systems still require language-specific pre- and post-processing, especially in tokenizations. In this article, I describe an algorithm that simplifies calculating correspondence between tokens (e.g. BERT vs. spaCy), one such process. And I introduce Python and Rust libraries that implement this algorithm. Here are the library and the demo site links:

@tamuhey
tamuhey / prompt.ps1
Created April 9, 2019 02:52
powershell prompt with git branch name and conda env name
# https://stackoverflow.com/a/44411205/10051099
function Write-BranchName () {
try {
$branch = git rev-parse --abbrev-ref HEAD
if ($branch -eq "HEAD") {
# we're probably in detached HEAD state, so print the SHA
$branch = git rev-parse --short HEAD
Write-Host " ($branch)" -ForegroundColor "red"
}
@tamuhey
tamuhey / file0.txt
Last active May 12, 2022 16:30
シンプルな第一原理計算(密度汎関数法)の Python コードを書く ref: https://qiita.com/tamurahey/items/9ac3ca91923d2834c7e0
n_grid = 200
x = np.linspace(-5, 5, n_grid)
h = x[1] - x[0]
D = -np.eye(n_grid) + np.diagflat(np.ones(n_grid-1), 1)
D /= h
#![allow(unused_imports)]
#![allow(unused_macros)]
use std::cmp::Reverse as R;
use std::collections::*;
use std::io::{stdin, BufWriter, Read, Write};
use std::mem;
#[allow(unused_macros)]
macro_rules! parse {
($it: ident ) => {};
@tamuhey
tamuhey / prime_regex.py
Created June 14, 2021 03:11
Check if a number is prime with regex
import re
def prime(n: int):
return re.match(r"^(aa+?)\1+$", "a" * n) is None
for i in range(2, 10000):
if prime(i):
print(i)
@tamuhey
tamuhey / rust_bug.md
Created September 29, 2020 17:13
Rust1.46をバグらせる簡単な例
trait Foo {}
fn bug() -> impl Foo<[(); |_: ()| {}]> {}

playground

// Ref: https://preshing.com/20120515/memory-reordering-caught-in-the-act/
// Let's change `Relaxed` to `SeqCst` and see what changed
use std::sync::atomic::AtomicUsize;
use std::sync::atomic::Ordering::*;
use std::thread::spawn;
static X: AtomicUsize = AtomicUsize::new(0);
static Y: AtomicUsize = AtomicUsize::new(0);
use rand;
use rand::Rng;
@tamuhey
tamuhey / test
Created May 13, 2020 16:44
nfkd_vs_nfkc
test
@tamuhey
tamuhey / docker_show_context.sh
Created May 18, 2020 10:41
docker_list_context.sh
docker build -t test . <<EOF
FROM busybox
RUN mkdir /tmp/build/
# Add context to /tmp/build/
COPY . /tmp/build/
EOF
docker run --rm -it test find /tmp/build
@tamuhey
tamuhey / tokenizations.md
Last active May 12, 2020 14:55
tokenizations

BertとSentencepieceのトークンの対応を効率的に計算する

言語処理をする際,mecabなどのトークナイザを使って分かち書きすることが多いと思います.本記事では,異なるトークナイザの出力(分かち書き)の対応を計算する方法とその実装(tokenizations)を紹介します. 例えば以下のような,sentencepieceとBERTの分かち書きの結果の対応を計算する,トークナイザの実装に依存しない方法を見ていきます.

# 分かち書き
(a) BERT          : ['フ', '##ヘルト', '##ゥス', '##フルク', '条約', 'を', '締結']
(b) sentencepiece : ['▁', 'フ', 'ベル', 'トゥス', 'ブルク', '条約', 'を', '締結']