Last active
January 28, 2018 09:28
-
-
Save loretoparisi/2d563a283311163255574b0d73985ea6 to your computer and use it in GitHub Desktop.
FastText Language Detection - Training on macOS
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!bin/bash | |
# grab labeled dataset | |
wget http://downloads.tatoeba.org/exports/sentences.tar.bz2 | |
bunzip2 sentences.tar.bz2 | |
tar xvf sentences.tar | |
# macos only for gshuf | |
brew install coreutils | |
awk -F"\t" '{print"__label__"$2" "$3}' < sentences.csv | gshuf > all.txt | |
head -n 10000 all.txt > valid.txt | |
tail -n +10001 all.txt > train.txt | |
# accuracy is 96.5% | |
./fasttext supervised -input train.txt -output langdetect -dim 16 | |
./fasttext test langdetect.bin valid.txt | |
# accuracy is 98.5% | |
./fasttext supervised -input train.txt -output langdetect -dim 16 -minn 2 -maxn 4 | |
./fasttext test langdetect.bin valid.txt |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment