Skip to content

Instantly share code, notes, and snippets.

@illy
Created February 22, 2012 22:07
Show Gist options
  • Save illy/1887757 to your computer and use it in GitHub Desktop.
Save illy/1887757 to your computer and use it in GitHub Desktop.
This file records the scripts used in the UoB PhD project.

#The Script of Data Preparation for the University of Birmingham PhD Dissertation Corpus Project

Sheng Li University of Birmingham

This file records the scripts used in the UoB PhD corpus project, hoping to provide some clues for the future projects.

All scripts here are basic shell scripts, which are consisted of different GNU tools, such as pdftotext.

Notice, if you want to run pdftotext on your own machine, you might install xpdf first. On OS X, you can use [homebrew]https://github.com/mxcl/homebrew, [macports]http://www.macports.org/ or [gentoo prefix]http://www.gentoo.org/proj/en/gentoo-alt/prefix/bootstrap-macos.xml to install this tool automatically, or compile the install package manually. On Debian platform, you can simply type

sudo apt-get install xpdf

The commands used here include:

  1. wget: The non-interactive network downloader
  2. find: search for files in a directory hierarchy
  3. rm: remove files or directories
  4. xargs: build and execute command lines from standard input
  5. cp: copy files and directories
  6. mv: move (rename) files
  7. cat: concatenate files and print on the standard output
  8. ls: list directory contents
  9. echo: display a line of text
  10. sed: stream editor for filtering and transforming text
  11. uniq: report or omit repeated lines
  12. sort: sort lines of text files
  13. pdfxpdf: Portable Document Format (PDF) to text converter
  14. grep: print lines matching a pattern

##1.Download all PhD dissertations from eThesis reposity##

wget -r -l2 -A pdf http://etheses.bham.ac.uk/view/awards/d=5Fph.html

The downloaded files are stored in a hierarchy folder.

##2. Extract all files from the subfolder structure##

  1. Get rid of all duplicated-name files

     find SOURCE_FOLDER -depth 2 -name "*ThumbnailVersion" -exec rm -rf \; 
     #the SOURCE_FOLDER rm cannot remove the folders, which is strange
    
     find SOURCE_FOLDER -depth 2 -name "*ThumbnailVersion" -exec wc \;
     # to look at how many duplicated-name files.
     #Use the OS X spotlight to find all duplicated-name files, and delete.
    
  2. Extract the rest PDF files

     find SOURCE_FOLDER -name "*.pdf" -exec cp TARGET_FOLDER \;
     #mv can be used as well, but cp is better.
    

##3. Convert all PDFs to plain texts##

This script is for uobphd project, aiming to convert all pdf files to plain texts.

	cd /home/corpususer/uob_phd/uob_clean/
	cat $i
	for i in *.pdf
	do pdftotext -nopgbrk -htmlmeta $i > /home/corpususer/uob_phd/uob_clean_txt/$i.txt
	done

##4. Comparison of the export result

for f in `ls`;       
    do echo "$f";
done > file.txt #extract all file names.

Remove all file extensions:

1. sed 's/\..\{3\}$//'

2. sed 's/\(.*\)\..*/\1/'

Compare:

grep FILES KEYWORD | sort | uniq
@illy
Copy link
Author

illy commented Apr 18, 2012

this script is for uobphd project, aiming to convert all pdf files to plain texts.

cd /home/corpususer/uob_phd/uob_clean/
cat $i
for i in *.pdf
do pdftotext -nopgbrk -nopgbrk $i > /home/corpususer/uob_phd/uob_clean_txt/$i.txt
done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment