illy/UoB PhD project scripts.md

## UoB PhD project scripts.md

      
    Raw
  

              UoB PhD project scripts.md
            
          
    #The Script of Data Preparation for the University of Birmingham PhD Dissertation Corpus Project
Sheng Li University of Birmingham

This file records the scripts used in the UoB PhD corpus project, hoping to provide some clues for the future projects.
All scripts here are basic shell scripts, which are consisted of different GNU tools, such as pdftotext.
Notice, if you want to run pdftotext on your own machine, you might install xpdf first. On OS X, you can use [homebrew]https://github.com/mxcl/homebrew, [macports]http://www.macports.org/ or [gentoo prefix]http://www.gentoo.org/proj/en/gentoo-alt/prefix/bootstrap-macos.xml to install this tool automatically, or compile the install package manually. On Debian platform, you can simply type
sudo apt-get install xpdf

The commands used here include:

wget: The non-interactive network downloader
find: search for files in a directory hierarchy
rm: remove files or directories
xargs: build and execute command lines from standard input
cp: copy files and directories
mv: move (rename) files
cat: concatenate files and print on the standard output
ls: list directory contents
echo: display a line of text
sed: stream editor for filtering and transforming text
uniq: report or omit repeated lines
sort: sort lines of text files
pdfxpdf: Portable Document Format (PDF) to text converter
grep: print lines matching a pattern

##1.Download all PhD dissertations from eThesis reposity##
wget -r -l2 -A pdf http://etheses.bham.ac.uk/view/awards/d=5Fph.html

The downloaded files are stored in a hierarchy folder.
##2. Extract all files from the subfolder structure##


Get rid of all duplicated-name files
 find SOURCE_FOLDER -depth 2 -name "*ThumbnailVersion" -exec rm -rf \; 
 #the SOURCE_FOLDER rm cannot remove the folders, which is strange

 find SOURCE_FOLDER -depth 2 -name "*ThumbnailVersion" -exec wc \;
 # to look at how many duplicated-name files.
 #Use the OS X spotlight to find all duplicated-name files, and delete.


Extract the rest PDF files
 find SOURCE_FOLDER -name "*.pdf" -exec cp TARGET_FOLDER \;
 #mv can be used as well, but cp is better.


##3. Convert all PDFs to plain texts##
This script is for uobphd project, aiming to convert all pdf files to plain texts.
	cd /home/corpususer/uob_phd/uob_clean/
	cat $i
	for i in *.pdf
	do pdftotext -nopgbrk -htmlmeta $i > /home/corpususer/uob_phd/uob_clean_txt/$i.txt
	done

##4. Comparison of the export result
for f in `ls`;       
    do echo "$f";
done > file.txt #extract all file names.

Remove all file extensions:
1. sed 's/\..\{3\}$//'

2. sed 's/\(.*\)\..*/\1/'

Compare:
grep FILES KEYWORD | sort | uniq