Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save chenkovsky/7f1305a59ef549ce3811b49a3f529334 to your computer and use it in GitHub Desktop.
Save chenkovsky/7f1305a59ef549ce3811b49a3f529334 to your computer and use it in GitHub Desktop.
FastText Language Detection - Training on macOS
#!bin/bash
# grab labeled dataset
wget http://downloads.tatoeba.org/exports/sentences.tar.bz2
bunzip2 sentences.tar.bz2
tar xvf sentences.tar
# macos only for gshuf
brew install coreutils
awk -F"\t" '{print"__label__"$2" "$3}' < sentences.csv | gshuf > all.txt
head -n 10000 all.txt > valid.txt
tail -n +10001 all.txt > train.txt
# accuracy is 96.5%
./fasttext supervised -input train.txt -output langdetect -dim 16
./fasttext test langdetect.bin valid.txt
# accuracy is 98.5%
./fasttext supervised -input train.txt -output langdetect -dim 16 -minn 2 -maxn 4
./fasttext test langdetect.bin valid.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment