#The Script of Data Preparation for the University of Birmingham PhD Dissertation Corpus Project
This file records the scripts used in the UoB PhD corpus project, hoping to provide some clues for the future projects.
All scripts here are basic shell scripts, which are consisted of different GNU tools, such as pdftotext.
Notice, if you want to run pdftotext on your own machine, you might install xpdf first. On OS X, you can use [homebrew]https://github.com/mxcl/homebrew, [macports]http://www.macports.org/ or [gentoo prefix]http://www.gentoo.org/proj/en/gentoo-alt/prefix/bootstrap-macos.xml to install this tool automatically, or compile the install package manually. On Debian platform, you can simply type
sudo apt-get install xpdf
The commands used here include:
- wget: The non-interactive network downloader
- find: search for files in a directory hierarchy
- rm: remove files or directories
- xargs: build and execute command lines from standard input
- cp: copy files and directories
- mv: move (rename) files
- cat: concatenate files and print on the standard output
- ls: list directory contents
- echo: display a line of text
- sed: stream editor for filtering and transforming text
- uniq: report or omit repeated lines
- sort: sort lines of text files
- pdfxpdf: Portable Document Format (PDF) to text converter
- grep: print lines matching a pattern
##1.Download all PhD dissertations from eThesis reposity##
wget -r -l2 -A pdf http://etheses.bham.ac.uk/view/awards/d=5Fph.html
The downloaded files are stored in a hierarchy folder.
##2. Extract all files from the subfolder structure##
-
Get rid of all duplicated-name files
find SOURCE_FOLDER -depth 2 -name "*ThumbnailVersion" -exec rm -rf \; #the SOURCE_FOLDER rm cannot remove the folders, which is strange find SOURCE_FOLDER -depth 2 -name "*ThumbnailVersion" -exec wc \; # to look at how many duplicated-name files. #Use the OS X spotlight to find all duplicated-name files, and delete.
-
Extract the rest PDF files
find SOURCE_FOLDER -name "*.pdf" -exec cp TARGET_FOLDER \; #mv can be used as well, but cp is better.
##3. Convert all PDFs to plain texts##
This script is for uobphd project, aiming to convert all pdf files to plain texts.
cd /home/corpususer/uob_phd/uob_clean/
cat $i
for i in *.pdf
do pdftotext -nopgbrk -htmlmeta $i > /home/corpususer/uob_phd/uob_clean_txt/$i.txt
done
##4. Comparison of the export result
for f in `ls`;
do echo "$f";
done > file.txt #extract all file names.
Remove all file extensions:
1. sed 's/\..\{3\}$//'
2. sed 's/\(.*\)\..*/\1/'
Compare:
grep FILES KEYWORD | sort | uniq
this script is for uobphd project, aiming to convert all pdf files to plain texts.
cd /home/corpususer/uob_phd/uob_clean/
cat $i
for i in *.pdf
do pdftotext -nopgbrk -nopgbrk $i > /home/corpususer/uob_phd/uob_clean_txt/$i.txt
done