Skip to content

Instantly share code, notes, and snippets.

@bobye
Created December 17, 2015 19:24
Show Gist options
  • Save bobye/3c64c2a03a53ce17bfc0 to your computer and use it in GitHub Desktop.
Save bobye/3c64c2a03a53ce17bfc0 to your computer and use it in GitHub Desktop.
create_ohsumed.sh
#!/bin/sh
#wget http://disi.unitn.it/moschitti/corpora/ohsumed-all-docs.tar.gz
#tar xvf ohsumed-all-docs.tar.gz
cd ohsumed-all
filename=ohsumed_clusters.txt
rm -f $filename
for category in C*; do
echo %%%$category >> $filename
for abstract in $category/*; do
tr '\n' ' ' < $abstract | ../tocorpus.pl >> $filename
done
done
mv $filename tmp.txt
awk '!seen[$0]++' tmp.txt > $filename # remove duplicate lines
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment