Balancing dataset based on labels
Clean arabic dataset
Limit number of words per sentence
Shuf clean and uniq
Get most frequent words
Loop through files and edit their extension or process them
cat ar_corpus_text.txt | grep '^(\w+\s){4,9}(\w+)$' > short_sentences.txt
cat short_sentences.txt | grep '[ء-ي]' | grep -v '[a-zA-Z0-9]' > clean_sentences.txt
cat clean_sentences.txt | sort | uniq > uniq_clean_sentences.txt
awk -F ',' 'NR==1{h=$0; next};!seen[$1]++{f=$1".txt"; print $5 > f};{f=$1".txt"; print $5 >> f; close(f)}' corpus.csv
for f in .txt; do echo "Processing $f"; cat $f | sed 's/./.\n/g' | sed 's/؟/؟\n/g' | grep '^(\w+\s){4,9}(\w+)(.|؟)
for f in *.woq; do shuf -n 600
Convert arabic file coming from Windows:
iconv -f windows-1256 -t utf8 original_file.txt > unix_file.txt
Find difference between two commands (Example check all audio files are aligned with corresponding text files):
diff <(ls *wav | sed 's/.wav//gi') <(ls *txt | sed 's/.txt//gi')
Find total lengths of all wav files in a folder:
soxi -D *.wav | awk '{print; total += $1 }; END {print "total size: ",total }'