Skip to content

Instantly share code, notes, and snippets.

@wael34218
Last active December 16, 2021 16:22
Show Gist options
  • Save wael34218/d5bc4c9b479b2b6f3b1d78daae4233c9 to your computer and use it in GitHub Desktop.
Save wael34218/d5bc4c9b479b2b6f3b1d78daae4233c9 to your computer and use it in GitHub Desktop.

Bash in Data Preparation

Balancing dataset based on labels

Clean arabic dataset

Limit number of words per sentence

Shuf clean and uniq

Get most frequent words

Loop through files and edit their extension or process them

cat ar_corpus_text.txt | grep '^(\w+\s){4,9}(\w+)$' > short_sentences.txt

cat short_sentences.txt | grep '[ء-ي]' | grep -v '[a-zA-Z0-9]' > clean_sentences.txt

cat clean_sentences.txt | sort | uniq > uniq_clean_sentences.txt

awk -F ',' 'NR==1{h=$0; next};!seen[$1]++{f=$1".txt"; print $5 > f};{f=$1".txt"; print $5 >> f; close(f)}' corpus.csv

for f in .txt; do echo "Processing $f"; cat $f | sed 's/./.\n/g' | sed 's/؟/؟\n/g' | grep '^(\w+\s){4,9}(\w+)(.|؟)$' | grep '[ء-ي]' | grep -v '[a-zA-Z]' | sort | uniq > "$(basename "$f" .txt).clean" ; done

for f in *.woq; do shuf -n 600 $f > "$(basename "$f" .woq).rdy" ; done ; wc -l *.rdy; cat *.rdy | sort | uniq | wc -l

Convert arabic file coming from Windows:

iconv -f windows-1256 -t utf8 original_file.txt > unix_file.txt

Find difference between two commands (Example check all audio files are aligned with corresponding text files):

diff <(ls *wav | sed 's/.wav//gi') <(ls *txt | sed 's/.txt//gi')

Find total lengths of all wav files in a folder:

soxi -D *.wav | awk '{print; total += $1 }; END {print "total size: ",total }'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment