Skip to content

Instantly share code, notes, and snippets.

@DeepInEvil
Last active August 29, 2015 14:27
Show Gist options
  • Save DeepInEvil/17d0bb3eb9edac622b9b to your computer and use it in GitHub Desktop.
Save DeepInEvil/17d0bb3eb9edac622b9b to your computer and use it in GitHub Desktop.
#!/bin/bash
#script for preprocessing html files using bash
####BETTER PERFORMANCE THAN PYTHON/R########
#Check for time
date
#loop through files
for file in `ls ./Train1/*.txt`
do
#Remove html tags
sed -i 's/<[^>]*>//g' $file
#chomp extra lines
perl -i -pe 'chomp, s/$// unless eof' $file
#Some more chomping
perl -i -ne '!/^\s+$/ && print' $file
#and moree.............
perl -i -p -e 's/(?:\n|\s+)/ /g' $file
#Remove punctuations now ok??
perl -p -i -e 's/\p{Punct}//g' $file
#The case should be lower
perl -p -i -e 'tr/A-Z/a-z/' $file
#java script tags are annoying
perl -p -i -e 's/getelementsbytag/ /g' $file
perl -p -i -e 's/=/ /g' $file
perl -p -i -e 's/==/ /g' $file
perl -p -i -e 's/function/ /g' $file
perl -p -i -e 's/var/ /g' $file
perl -p -i -e 's/jquery/ /g' $file
#Stop STOP!!!! I mean remove stop words
/home/debanjan/TrulyNative/RemoveStop.sh /home/debanjan/TrulyNative/blacklist.stop $file
#Hey text whats up?? what do you contain???
/home/debanjan/TrulyNative/getFrequency.sh $file > $file.freq
# aaah average engish word length is 5.19 characters, f*** 13 characters!!!
perl -i -lne 'length() < 13 && print' $file.freq
#Alphanumeric, I dont want you!!
perl -i -nle 'print if m{^[a-zA-Z]+$}' $file.freq
#enough is enough, lets roll now
sed -i -e :a -e 'N;s/\n/ /;ta' $file.freq
done
#Sorry texts I have frequencies now, so f*** you!!
rm ./Train1/*.txt
#What is the time now? not much I think
date
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment