-
-
Save chenkovsky/7f1305a59ef549ce3811b49a3f529334 to your computer and use it in GitHub Desktop.
FastText Language Detection - Training on macOS
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!bin/bash | |
# grab labeled dataset | |
wget http://downloads.tatoeba.org/exports/sentences.tar.bz2 | |
bunzip2 sentences.tar.bz2 | |
tar xvf sentences.tar | |
# macos only for gshuf | |
brew install coreutils | |
awk -F"\t" '{print"__label__"$2" "$3}' < sentences.csv | gshuf > all.txt | |
head -n 10000 all.txt > valid.txt | |
tail -n +10001 all.txt > train.txt | |
# accuracy is 96.5% | |
./fasttext supervised -input train.txt -output langdetect -dim 16 | |
./fasttext test langdetect.bin valid.txt | |
# accuracy is 98.5% | |
./fasttext supervised -input train.txt -output langdetect -dim 16 -minn 2 -maxn 4 | |
./fasttext test langdetect.bin valid.txt |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment