Yohei Tamura tamuhey

## prime_regex.py
import re


def prime(n: int):
    return re.match(r"^(aa+?)\1+$", "a" * n) is None


for i in range(2, 10000):
    if prime(i):
        print(i)

## 049.rs
#![allow(unused_imports)]
#![allow(unused_macros)]
use std::cmp::Reverse as R;
use std::collections::*;
use std::io::{stdin, BufWriter, Read, Write};
use std::mem;

#[allow(unused_macros)]
macro_rules! parse {
    ($it: ident ) => {};

## rust_bug.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                tamuhey
                / rust_bug.md
            
            
              Created
              September 29, 2020 17:13
            
              
                Rust1.46をバグらせる簡単な例
              
          
    trait Foo {}
fn bug() -> impl Foo<[(); |_: ()| {}]> {}
playground

  
## rust_memory_reorder.rs
// Ref: https://preshing.com/20120515/memory-reordering-caught-in-the-act/
// Let's change `Relaxed` to `SeqCst` and see what changed
use std::sync::atomic::AtomicUsize;
use std::sync::atomic::Ordering::*;
use std::thread::spawn;
static X: AtomicUsize = AtomicUsize::new(0);
static Y: AtomicUsize = AtomicUsize::new(0);

use rand;
use rand::Rng;

## tokenizations_post.md

      
              1 file
            
          
              2 forks
            
          
              0 comments
            
          
              63 stars
            
          
                tamuhey
                / tokenizations_post.md
            
            
              Last active
              March 30, 2024 19:00
            
              
                How to calculate the alignment between BERT and spaCy tokens effectively and robustly
              
          
    How to calculate the alignment between BERT and spaCy tokens effectively and robustly


site: https://tamuhey.github.io/tokenizations/
Natural Language Processing (NLP) has made great progress in recent years because of neural networks, which allows us to solve various tasks with end-to-end architecture. However, many NLP systems still require language-specific pre- and post-processing, especially in tokenizations. In this article, I describe an algorithm that simplifies calculating correspondence between tokens (e.g. BERT vs. spaCy), one such process. And I introduce Python and Rust libraries that implement this algorithm.
Here are the library and the demo site links:

repo: https://github.com/tamuhey/tokenizations


## docker_show_context.sh
docker build -t test . <<EOF
FROM busybox

RUN mkdir /tmp/build/
# Add context to /tmp/build/
COPY . /tmp/build/
EOF

docker run --rm -it test find /tmp/build

## test
test

## tokenizations.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                tamuhey
                / tokenizations.md
            
            
              Last active
              May 12, 2020 14:55
            
              
                tokenizations
              
          
    BertとSentencepieceのトークンの対応を効率的に計算する

言語処理をする際，mecabなどのトークナイザを使って分かち書きすることが多いと思います．本記事では，異なるトークナイザの出力（分かち書き）の対応を計算する方法とその実装（tokenizations）を紹介します．
例えば以下のような，sentencepieceとBERTの分かち書きの結果の対応を計算する，トークナイザの実装に依存しない方法を見ていきます．
# 分かち書き
(a) BERT          : ['フ', '##ヘルト', '##ゥス', '##フルク', '条約', 'を', '締結']
(b) sentencepiece : ['▁', 'フ', 'ベル', 'トゥス', 'ブルク', '条約', 'を', '締結']


## tokenizations.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                tamuhey
                / tokenizations.md
            
            
              Last active
              February 18, 2020 11:12
            
          
    2つの分かち書きの対応を計算する

言語処理をする際，mecabなどのトークナイザを使ってテキストを分かち書きすることが多いと思います．本記事では，異なるトークナイザの出力（分かち書き）の対応を計算する方法とその実装(tokenizations)を紹介します．
例えば以下のような，sentencepieceとBERTの分かち書きの結果の対応を計算する，トークナイザの実装に依存しない一般的な方法を見ていきます．
# 分かち書き
(a) BERT          : ['フ', '##ヘルト', '##ゥス', '##フルク', '条約', 'を', '締結']
(b) sentencepiece : ['▁', 'フ', 'ベル', 'トゥス', 'ブルク', '条約', 'を', '締結']


## tokenizations.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                tamuhey
                / tokenizations.md
            
            
              Created
              February 18, 2020 11:03
            
              
                https://github.com/tamuhey/tokenizations
              
          
    2つの分かち書きの対応を計算する

言語処理をする際，mecabなどのトークナイザを使ってテキストを分かち書きすることが多いと思います．本記事では，異なるトークナイザの出力（分かち書き）の対応を計算する方法とその実装(tokenizations)を紹介します．
例えば，以下のようなsentencepieceとBERTの分かち書きの結果の対応を計算する，トークナイザの実装に依存しない一般的な方法を見ていきます．
# 分かち書き
(a) BERT          : ['フ', '##ヘルト', '##ゥス', '##フルク', '条約', 'を', '締結']
(b) sentencepiece : ['▁', 'フ', 'ベル', 'トゥス', 'ブルク', '条約', 'を', '締結']
	import re


	def prime(n: int):
	return re.match(r"^(aa+?)\1+$", "a" * n) is None


	for i in range(2, 10000):
	if prime(i):
	print(i)
	#![allow(unused_imports)]
	#![allow(unused_macros)]
	use std::cmp::Reverse as R;
	use std::collections::*;
	use std::io::{stdin, BufWriter, Read, Write};
	use std::mem;

	#[allow(unused_macros)]
	macro_rules! parse {
	($it: ident ) => {};
	// Ref: https://preshing.com/20120515/memory-reordering-caught-in-the-act/
	// Let's change `Relaxed` to `SeqCst` and see what changed
	use std::sync::atomic::AtomicUsize;
	use std::sync::atomic::Ordering::*;
	use std::thread::spawn;
	static X: AtomicUsize = AtomicUsize::new(0);
	static Y: AtomicUsize = AtomicUsize::new(0);

	use rand;
	use rand::Rng;
	docker build -t test . <<EOF
	FROM busybox

	RUN mkdir /tmp/build/
	# Add context to /tmp/build/
	COPY . /tmp/build/
	EOF

	docker run --rm -it test find /tmp/build