Skip to content

Instantly share code, notes, and snippets.

@ibrahima
Created June 9, 2010 21:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ibrahima/432204 to your computer and use it in GitHub Desktop.
Save ibrahima/432204 to your computer and use it in GitHub Desktop.
#!/bin/bash
#This script simply splits up the reuters-21578 dataset into separate files for each article.
for f in reut2-*.sgm
do
echo $f
sed '1d' $f | csplit -ks -n 3 -f split/${f%.sgm} - '/<REUTERS/' {100000}
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment