stevenklise/2012-02-29-text-parsing.md

## 2012-02-29-text-parsing.md

      
    Raw
  

              2012-02-29-text-parsing.md
            
          
    Uploading The New Batch of OCR Texts

In this document I use bold text to denote the name
There are 1,188 files in the ThesisTXT_round2 folder, and currently 1,094 Thesis records on itparchive.com. This mismatch is based on file names that were not properly parsed by the original upload script.
Each Thesis can have multiple Documentations. A Documentation stores the URL of the PDF, some admnistrative information about the state of the PDF and a text field titled paper with the contents of the corresponding .txt file. We will assume that all of the text files from the second round of OCR are better than all of the first round. And since some Documentations might not match to the new text files if we relied on the new text file replacing the contents of paper there could be a few Documentations which don't have their paper field updated. So the first step is to erase the paper field on every Documentation.
Out with the Old

In Terminal, navigate to the Rails project and start up the console (this is an irb session with the code from the Rails project loaded).
(bash)
rails console

Now select all Documentations:
(rails console)
@documentations = Documentation.all

The object @documentations is an Array of Documentation objects. We iterate through every Documentation and update the paper attribute to nil:
(rails console)
@documentations.each do |documentation|
	documentation.update_attributes(:paper => nil)
end

Experiment with the New

What we just did only changed the development database. The production database which powers itparchive.com still has all the junky paper fields. Do everything in development first and then when it works, push it to production.
When it's time to do this process on production we will write one script to do it all, for now we will continue working in the console to see the output of every command.
Get File Names

Hopefully every .txt file name matches a PDF file name. And hopefully these file names follow a format similar to
LASTNAME.FIRSTNAME.YEAR.thesis.doc.(txt|PDF)

The first three sections, as divided by periods (.) are the important ones that uniquelly identify the file with the person who wrote the paper.
To load the list of .txt files into the console we run this command:
(rails console)
text_list = Dir.glob("../ThesisTXT_round2/*.txt")

Working from inside /Archive/itparchive the text files are in the folder /Archive/ThesisTXT_round2/. We are saving the list of filenames to the text_list variable. These are just names of files and not the content yet.
The PDF file names are saved in the media_file_name attribute of each Documentation. These next lines will save every file name in to an array named pdfs:
(rails console)
pdfs = []
@documentations.each do |documentation|
	pdfs << documentation.media_file_name
end

Compare file names

Now it's time to compare the text names to the pdf names. Let's look at the first element of both text_list and pdfs.
(rails console)
text_list.first
=> "../ThesisTXT_round2/abeson.kristina.1995.thesis.doc.txt"
pdfs.first
=> "berry.sarah.2000.thesis.doc.pdf"

Notice in text_list we have some directory information at the front of the string. We don't need that and in fact it makes it harder to compare with pdfs. Let's get rid of it.
(rails console)
text_names = []
text_list.each do |text|
	text_array = text.split("/")
	text_names << text_array[2]
end
text_names.first
=> "abeson.kristina.1995.thesis.doc.txt"

Much nicer. Now if we split a text_names element and a pdfs element at all the periods, if the first 3 things match, we have a match.
(rails console)
text = text_names.first.split(".")
=> ["abeson", "kristina", "1995", "thesis", "doc", "txt"]
pdf = pdfs.first.split(".")
=> ["berry", "sarah", "2000", "thesis", "doc", "pdf"]

See? Let's test it out.
(rails console)
matched = 0
unmatched = 0
text_names.each do |text|
	find_match = false
	pdfs.each do |pdf|
		split_pdf = pdf.split(".")
		split_text = text.split(".")
		if split_pdf[0] == split_text[0] && split_pdf[1] == split_text[1]
			find_match = true
		end
	end
	if find_match
		matched += 1
	else
		unmatched += 1
	end
end
matched
=> 1061
unmatched
=> 127

And 1061 + 127 = 1188, which is great because that is exactly how many text files we have. And it is less than the total number of PDFs (1093).
In with the New

We're looking pretty good. But first let's see what those text and pdf files are that aren't matching up. And also, just reassure ourselves that there aren't duplicates.
(rails console)
pdfs.length
=> 1093
pdfs.uniq.length
=> 1093

So we don't have any duplicate PDF names. Good.
(rails console)
matched = 0
unmatched = 0
matched_text_files = []
unmatched_text_files = []
matched_pdfs = []
text_names.each do |text|
	find_match = false
	pdfs.each do |pdf|
		split_pdf = pdf.split(".")
		split_text = text.split(".")
		if split_text.length >= 3
			if split_pdf[0] == split_text[0] && split_pdf[1] == split_text[1] && split_pdf[2] == split_text[2]
				matched_pdfs << pdf
				find_match = true
			end
		end
	end
	if find_match
		matched += 1
		matched_text_files << text
	else
		unmatched += 1
		unmatched_text_files << text
	end
end
matched
=> 1058
unmatched
=> 130
matched + unmatched
=> 1188

Ok. I got rid of checking for the year (split_text[2]) and saved 3 more .txt files from orphanage.
The final script I ran is in /lib/tasks/upload_ocr.rake.
It outputs 3 files, one with the list of all the .txt files that were matched, one with the .txt files that were unmatched and then the list of Documentation ids that were not matched.
There are 34 Documentations in the database without text files. If each of these has a text file that simply didn't match then that means there are 103 text files that don't correspond to anything in the database. Which could mean duplicates, or Thesis projects that were missed in the initial upload.

  
## pdfs_not_on_site.txt
20110314094754600.pdf
avDOC.PDF
barto000.PDF
barto001.PDF
berko000.PDF
blaze000.PDF
bobro000.PDF
bonaf000.PDF
borbe000.PDF
bose.000.PDF
bruu.m.lois.1989.thesis.PDF
bruu.m.lois.1989.thesis_1.PDF
buchanan.m.john.1989.thesis_1.PDF
buchanan.m.john.1989.thesis_2.PDF
burke.l.ann.1984.thesis.PDF
calla000.PDF
campbell..b.robert.1998.thesis_1.PDF
campbell..b.robert.1998.thesis_2.PDF
canar000.PDF
canario.a.l.1994.thesis.PDF
carty.a.david.1997.thesis.PDF
cashin.c.ronald.1991.PDF
chang.mich.goodman.eliz.2003.thesis.doc.PDF
chang.yufang.unknown.thesis.doc.PDF
chann000.PDF
cho.lisa.unknown.thesis.doc.PDF
choe.000.PDF
chow.ho-ming.sheldon.2000.thesis.PDF
chullasapya.suchard.unknown.thesis.doc.PDF
chung000.PDF
colichio.torry.year.thesis.doc.pdf
couta000.PDF
cowen000.PDF
d'aspice.deanna.thesis.PDF
degeo000.PDF
delacruz.rene.juan.2000.thesis.doc.PDF
depietro.peter.year.thesis.doc.PDF
DOC.PDF
DOC000.PDF
dubois.r.luke.2006.thesis.doc.PDF
dunlap.charles.unknown.thesis.doc.PDF
dupree.nicole.year.thesis.doc.PDF
eid.erik.thesis.PDF
engel.adam.thesis.PDF
every.van.shawn.2004.thesis.PDF
ezeta.ivo.thesis.doc.PDF
fallon.william.thesis.doc.PDF
farley.linda.thesis.doc.PDF
feid.000.PDF
gisel000.PDF
giselle.leal.year.thesis.doc.PDF
giummo.joan.unknown.thesis.doc.PDF
goff.000.PDF
gonzalez.christopher2001.thesis.doc.PDF
guerr000.PDF
halpe000.PDF
hamazaki.kumi.unknown.thesis.doc.PDF
harrell.jr.d.fox.2000.PDF
hong.000.PDF
inoue.hidetaka.thesis.doc.PDF
jin.wook.sungpark.moon.2002.thesis.doc.PDF
joness.r.john.1992.thesis.PDF
jungman.s.erika.2000.thesis.PDF
kanatani.tomohiko.thesis.1993.PDF
kapla000.PDF
keller.carol.year.thesis.doc.pdf
kemelson.joel.adam.1991.thesis.PDF
kirov000.PDF
langer.miriam.year.thesis.doc.pdf
last.first.year.thesis.type.pdf
lawso000.PDF
lim.t000.PDF
lu.jenjui.unknown.thesis.doc.PDF
lupo.jonathan.year.thesis.doc.pdf
mangat.chetan.thesis.doc.PDF
maras000.PDF
marko000.PDF
markowitz.c.gary.1994.thesis.PDF
marsh000.PDF
matt.berger.PDF
mcfad000.PDF
mcgar000.PDF
mcken000.PDF
meehan.j.patrick.1994.thesis.PDF
megal000.PDF
meije000.PDF
mizoh000.PDF
morley.d.beau.1999.thesis.doc.PDF
morri000.PDF
murray.m.christina.1994.thesis.PDF
ortiz000.PDF
palma000.PDF
pollex.alessandro.year.thesis.doc.pdf
pratt000.PDF
psomi000.PDF
raffel.daniel.year.thesis.doc.PDF
ramos000.PDF
reyes000.PDF
roscoe.seidenberg.sheynkman.thesis.doc.2.PDF
roscoe.seidenberg.sheynkman.thesis.doc.PDF
rosen.zachary.unknown.thesis.doc.PDF
ruitg000.PDF
ruitg001.PDF
russe000.PDF
russe001.PDF
schul000.PDF
schwa000.PDF
sealy000.PDF
seitz.jr.robert.1988.thesis.doc.PDF
sloan.tamara.thesis.doc.PDF
smole000.PDF
smole001.PDF
spraf000.PDF
spraf001.PDF
stack000.PDF
stack001.PDF
stack002.PDF
suiss000.PDF
sun.dachen.unknown.thesis.doc.PDF
sungp000.PDF
talisman.linda.thesis.PDF
tan.xiaoli..2005.thesis.doc.PDF
taylo000.PDF
thomp000.PDF
varse000.PDF
vasil000.PDF
washington.theresa.thesis.doc.PDF
weinstein.anat.unknown.thesis.doc.PDF
wolf.ahmi.ophra.mogensen.2004.thesis.doc.PDF
wu.hsinyi.unknown.thesis.doc.PDF
xu.xu000.PDF
yen.h000.PDF
yeung.marianne.miller.do.2000.thesis.doc.PDF
yokoy000.PDF

## Unmatched-txt-files-2012-02-29-145056-0500.txt
amcbrochure.DOC.txt
avDOC.txt
barto000.txt
barto001.txt
berko000.txt
blaze000.txt
bobro000.txt
bonaf000.txt
borbe000.txt
bose.000.txt
bruu.m.lois.1989.thesis.txt
bruu.m.lois.1989.thesis_1.txt
buchanan.m.john.1989.thesis_1.txt
buchanan.m.john.1989.thesis_2.txt
burke.l.ann.1984.thesis.txt
calla000.txt
campbell..b.robert.1998.thesis_1.txt
campbell..b.robert.1998.thesis_2.txt
canar000.txt
canario.a.l.1994.thesis.txt
carty.a.david.1997.thesis.txt
cashin.c.ronald.1991.txt
chang.mich.goodman.eliz.2003.thesis.doc.txt
chang.yufang.unknown.thesis.doc.txt
chann000.txt
cho.lisa.unknown.thesis.doc.txt
choe.000.txt
chow.ho-ming.sheldon.2000.thesis.txt
chullasapya.suchard.unknown.thesis.doc.txt
chung000.txt
couta000.txt
cowen000.txt
d'aspice.deanna.thesis.txt
degeo000.txt
delacruz.rene.juan.2000.thesis.doc.txt
depietro.peter.year.thesis.doc.txt
DOC.txt
DOC000.txt
dubois.r.luke.2006.thesis.doc.txt
dunlap.charles.unknown.thesis.doc.txt
dupree.nicole.year.thesis.doc.txt
eid.erik.thesis.txt
engel.adam.thesis.txt
every.van.shawn.2004.thesis.txt
fallon.william.thesis.doc.txt
farley.linda.thesis.doc.txt
feid.000.txt
gisel000.txt
giselle.leal.year.thesis.doc.txt
giummo.joan.unknown.thesis.doc.txt
goff.000.txt
gonzalez.christopher2001.thesis.doc.txt
guerr000.txt
halpe000.txt
harrell.jr.d.fox.2000.txt
hong.000.txt
inoue.hidetaka.thesis.doc.txt
jin.wook.sungpark.moon.2002.thesis.doc.txt
joness.r.john.1992.thesis.txt
jungman.s.erika.2000.thesis.txt
kanatani.tomohiko.thesis.1993.txt
kapla000.txt
kemelson.joel.adam.1991.thesis.txt
kirov000.txt
lawso000.txt
lim.t000.txt
lu.jenjui.unknown.thesis.doc.txt
mangat.chetan.thesis.doc.txt
maras000.txt
marko000.txt
markowitz.c.gary.1994.thesis.txt
marsh000.txt
matt.berger.txt
mcfad000.txt
mcgar000.txt
mcken000.txt
meehan.j.patrick.1994.thesis.txt
megal000.txt
meije000.txt
mizoh000.txt
morley.d.beau.1999.thesis.doc.txt
morri000.txt
murray.m.christina.1994.thesis.txt
ortiz000.txt
palma000.txt
pratt000.txt
psomi000.txt
ramos000.txt
reyes000.txt
roscoe.seidenberg.sheynkman.thesis.doc.2.txt
roscoe.seidenberg.sheynkman.thesis.doc.txt
rosen.zachary.unknown.thesis.doc.txt
ruitg000.txt
ruitg001.txt
russe000.txt
russe001.txt
schul000.txt
schwa000.txt
sealy000.txt
seitz.jr.robert.1988.thesis.doc.txt
sloan.tamara.thesis.doc.txt
smole000.txt
smole001.txt
spieg000.txt
spiegel.steven.1983.thesis.doc.txt
spraf000.txt
spraf001.txt
stack000.txt
stack001.txt
stack002.txt
suiss000.txt
sun.dachen.unknown.thesis.doc.txt
sungp000.txt
talisman.linda.thesis.txt
tan.xiaoli..2005.thesis.doc.txt
taylo000.txt
thomp000.txt
varse000.txt
vasil000.txt
washington.theresa.thesis.doc.txt
weinstein.anat.unknown.thesis.doc.txt
wolf.ahmi.ophra.mogensen.2004.thesis.doc.txt
wu.hsinyi.unknown.thesis.doc.txt
xu.xu000.txt
yen.h000.txt
yeung.marianne.miller.do.2000.thesis.doc.txt
yokoy000.txt

## Unmatched-txt-files-2012-02-29-14:50:56-0500.txt
amcbrochure.DOC.txt
avDOC.txt
barto000.txt
barto001.txt
berko000.txt
blaze000.txt
bobro000.txt
bonaf000.txt
borbe000.txt
bose.000.txt
bruu.m.lois.1989.thesis.txt
bruu.m.lois.1989.thesis_1.txt
buchanan.m.john.1989.thesis_1.txt
buchanan.m.john.1989.thesis_2.txt
burke.l.ann.1984.thesis.txt
calla000.txt
campbell..b.robert.1998.thesis_1.txt
campbell..b.robert.1998.thesis_2.txt
canar000.txt
canario.a.l.1994.thesis.txt
carty.a.david.1997.thesis.txt
cashin.c.ronald.1991.txt
chang.mich.goodman.eliz.2003.thesis.doc.txt
chang.yufang.unknown.thesis.doc.txt
chann000.txt
cho.lisa.unknown.thesis.doc.txt
choe.000.txt
chow.ho-ming.sheldon.2000.thesis.txt
chullasapya.suchard.unknown.thesis.doc.txt
chung000.txt
couta000.txt
cowen000.txt
d'aspice.deanna.thesis.txt
degeo000.txt
delacruz.rene.juan.2000.thesis.doc.txt
depietro.peter.year.thesis.doc.txt
DOC.txt
DOC000.txt
dubois.r.luke.2006.thesis.doc.txt
dunlap.charles.unknown.thesis.doc.txt
dupree.nicole.year.thesis.doc.txt
eid.erik.thesis.txt
engel.adam.thesis.txt
every.van.shawn.2004.thesis.txt
fallon.william.thesis.doc.txt
farley.linda.thesis.doc.txt
feid.000.txt
gisel000.txt
giselle.leal.year.thesis.doc.txt
giummo.joan.unknown.thesis.doc.txt
goff.000.txt
gonzalez.christopher2001.thesis.doc.txt
guerr000.txt
halpe000.txt
harrell.jr.d.fox.2000.txt
hong.000.txt
inoue.hidetaka.thesis.doc.txt
jin.wook.sungpark.moon.2002.thesis.doc.txt
joness.r.john.1992.thesis.txt
jungman.s.erika.2000.thesis.txt
kanatani.tomohiko.thesis.1993.txt
kapla000.txt
kemelson.joel.adam.1991.thesis.txt
kirov000.txt
lawso000.txt
lim.t000.txt
lu.jenjui.unknown.thesis.doc.txt
mangat.chetan.thesis.doc.txt
maras000.txt
marko000.txt
markowitz.c.gary.1994.thesis.txt
marsh000.txt
matt.berger.txt
mcfad000.txt
mcgar000.txt
mcken000.txt
meehan.j.patrick.1994.thesis.txt
megal000.txt
meije000.txt
mizoh000.txt
morley.d.beau.1999.thesis.doc.txt
morri000.txt
murray.m.christina.1994.thesis.txt
ortiz000.txt
palma000.txt
pratt000.txt
psomi000.txt
ramos000.txt
reyes000.txt
roscoe.seidenberg.sheynkman.thesis.doc.2.txt
roscoe.seidenberg.sheynkman.thesis.doc.txt
rosen.zachary.unknown.thesis.doc.txt
ruitg000.txt
ruitg001.txt
russe000.txt
russe001.txt
schul000.txt
schwa000.txt
sealy000.txt
seitz.jr.robert.1988.thesis.doc.txt
sloan.tamara.thesis.doc.txt
smole000.txt
smole001.txt
spieg000.txt
spiegel.steven.1983.thesis.doc.txt
spraf000.txt
spraf001.txt
stack000.txt
stack001.txt
stack002.txt
suiss000.txt
sun.dachen.unknown.thesis.doc.txt
sungp000.txt
talisman.linda.thesis.txt
tan.xiaoli..2005.thesis.doc.txt
taylo000.txt
thomp000.txt
varse000.txt
vasil000.txt
washington.theresa.thesis.doc.txt
weinstein.anat.unknown.thesis.doc.txt
wolf.ahmi.ophra.mogensen.2004.thesis.doc.txt
wu.hsinyi.unknown.thesis.doc.txt
xu.xu000.txt
yen.h000.txt
yeung.marianne.miller.do.2000.thesis.doc.txt
yokoy000.txt

## upload_ocr.rake
namespace :ocr do
  desc "Remove all documentation.paper values"
  task :remove_papers => :environment do
    @documentations = Documentation.all
    @documentations.each do |documentation|
      documentation.update_attributes(:paper => nil)
    end
  end

  desc "Upload new OCR documents, output to terminal mismatches"
  task :save_texts => :environment do

    @documentations = Documentation.all
    text_list = Dir.glob("../ThesisTXT_round2/*.txt")

    text_names = []
    text_list.each do |text|
      text_array = text.split("/")
      text_names << text_array[2]
    end

    matched = 0
      unmatched = 0
      matched_text_files = []
      unmatched_text_files = []
      matched_pdfs = []
      text_names.each do |text|
        find_match = false
        match = nil
        @documentations.each do |documentation|
          pdf = documentation.media_file_name
          split_pdf = pdf.split(".")
          split_text = text.split(".")
          # if split_text.length >= 3
            if split_pdf[0] == split_text[0] && split_pdf[1] == split_text[1] #&& split_pdf[2] == split_text[2]
              matched_pdfs << documentation.id
              find_match = true
              match = documentation
            end
          # end
        end
        if find_match
          matched += 1
          matched_text_files << text
          paper = File.open("../ThesisTXT_round2/#{text}").readlines
          while paper.class == Array
            paper = paper.join("\n")
          end
          match.update_attributes paper: paper
        else
          unmatched += 1
          unmatched_text_files << text
        end
      end

    puts "MATCHED = #{matched}"
    if f = File.new("#{Rails.root}/log/matched_text_files.txt", "w")
      f.write(Time.now.to_s+"\n")
      matched_text_files.each do |e|
        f.write(e+"\n")
      end
      f.close
    end
    puts ""
    puts "UNMATCHED = #{unmatched}"
    # puts unmatched_text_files
    if f = File.new("#{Rails.root}/log/unmatched_text_files.txt", "w")
      f.write(Time.now.to_s+"\n")
      unmatched_text_files.each do |e|
        f.write(e+"\n")
      end
      f.close
    end
    puts ""

    unmatched_documentations = []
    @documentations.each do |documentation|
      if ! matched_pdfs.include? documentation.id
        unmatched_documentations << documentation.id
      end
    end

    if f = File.new("#{Rails.root}/log/unmatched_documentation_ids.txt", "w")
      f.write(Time.now.to_s + "\n")
      unmatched_documentations.each do |e|
        f.write("#{e}\n")
      end
      f.close
    end

  end
end
	20110314094754600.pdf
	avDOC.PDF
	barto000.PDF
	barto001.PDF
	berko000.PDF
	blaze000.PDF
	bobro000.PDF
	bonaf000.PDF
	borbe000.PDF
	bose.000.PDF
	bruu.m.lois.1989.thesis.PDF
	bruu.m.lois.1989.thesis_1.PDF
	buchanan.m.john.1989.thesis_1.PDF
	buchanan.m.john.1989.thesis_2.PDF
	burke.l.ann.1984.thesis.PDF
	calla000.PDF
	campbell..b.robert.1998.thesis_1.PDF
	campbell..b.robert.1998.thesis_2.PDF
	canar000.PDF
	canario.a.l.1994.thesis.PDF
	carty.a.david.1997.thesis.PDF
	cashin.c.ronald.1991.PDF
	chang.mich.goodman.eliz.2003.thesis.doc.PDF
	chang.yufang.unknown.thesis.doc.PDF
	chann000.PDF
	cho.lisa.unknown.thesis.doc.PDF
	choe.000.PDF
	chow.ho-ming.sheldon.2000.thesis.PDF
	chullasapya.suchard.unknown.thesis.doc.PDF
	chung000.PDF
	colichio.torry.year.thesis.doc.pdf
	couta000.PDF
	cowen000.PDF
	d'aspice.deanna.thesis.PDF
	degeo000.PDF
	delacruz.rene.juan.2000.thesis.doc.PDF
	depietro.peter.year.thesis.doc.PDF
	DOC.PDF
	DOC000.PDF
	dubois.r.luke.2006.thesis.doc.PDF
	dunlap.charles.unknown.thesis.doc.PDF
	dupree.nicole.year.thesis.doc.PDF
	eid.erik.thesis.PDF
	engel.adam.thesis.PDF
	every.van.shawn.2004.thesis.PDF
	ezeta.ivo.thesis.doc.PDF
	fallon.william.thesis.doc.PDF
	farley.linda.thesis.doc.PDF
	feid.000.PDF
	gisel000.PDF
	giselle.leal.year.thesis.doc.PDF
	giummo.joan.unknown.thesis.doc.PDF
	goff.000.PDF
	gonzalez.christopher2001.thesis.doc.PDF
	guerr000.PDF
	halpe000.PDF
	hamazaki.kumi.unknown.thesis.doc.PDF
	harrell.jr.d.fox.2000.PDF
	hong.000.PDF
	inoue.hidetaka.thesis.doc.PDF
	jin.wook.sungpark.moon.2002.thesis.doc.PDF
	joness.r.john.1992.thesis.PDF
	jungman.s.erika.2000.thesis.PDF
	kanatani.tomohiko.thesis.1993.PDF
	kapla000.PDF
	keller.carol.year.thesis.doc.pdf
	kemelson.joel.adam.1991.thesis.PDF
	kirov000.PDF
	langer.miriam.year.thesis.doc.pdf
	last.first.year.thesis.type.pdf
	lawso000.PDF
	lim.t000.PDF
	lu.jenjui.unknown.thesis.doc.PDF
	lupo.jonathan.year.thesis.doc.pdf
	mangat.chetan.thesis.doc.PDF
	maras000.PDF
	marko000.PDF
	markowitz.c.gary.1994.thesis.PDF
	marsh000.PDF
	matt.berger.PDF
	mcfad000.PDF
	mcgar000.PDF
	mcken000.PDF
	meehan.j.patrick.1994.thesis.PDF
	megal000.PDF
	meije000.PDF
	mizoh000.PDF
	morley.d.beau.1999.thesis.doc.PDF
	morri000.PDF
	murray.m.christina.1994.thesis.PDF
	ortiz000.PDF
	palma000.PDF
	pollex.alessandro.year.thesis.doc.pdf
	pratt000.PDF
	psomi000.PDF
	raffel.daniel.year.thesis.doc.PDF
	ramos000.PDF
	reyes000.PDF
	roscoe.seidenberg.sheynkman.thesis.doc.2.PDF
	roscoe.seidenberg.sheynkman.thesis.doc.PDF
	rosen.zachary.unknown.thesis.doc.PDF
	ruitg000.PDF
	ruitg001.PDF
	russe000.PDF
	russe001.PDF
	schul000.PDF
	schwa000.PDF
	sealy000.PDF
	seitz.jr.robert.1988.thesis.doc.PDF
	sloan.tamara.thesis.doc.PDF
	smole000.PDF
	smole001.PDF
	spraf000.PDF
	spraf001.PDF
	stack000.PDF
	stack001.PDF
	stack002.PDF
	suiss000.PDF
	sun.dachen.unknown.thesis.doc.PDF
	sungp000.PDF
	talisman.linda.thesis.PDF
	tan.xiaoli..2005.thesis.doc.PDF
	taylo000.PDF
	thomp000.PDF
	varse000.PDF
	vasil000.PDF
	washington.theresa.thesis.doc.PDF
	weinstein.anat.unknown.thesis.doc.PDF
	wolf.ahmi.ophra.mogensen.2004.thesis.doc.PDF
	wu.hsinyi.unknown.thesis.doc.PDF
	xu.xu000.PDF
	yen.h000.PDF
	yeung.marianne.miller.do.2000.thesis.doc.PDF
	yokoy000.PDF
	amcbrochure.DOC.txt
	avDOC.txt
	barto000.txt
	barto001.txt
	berko000.txt
	blaze000.txt
	bobro000.txt
	bonaf000.txt
	borbe000.txt
	bose.000.txt
	bruu.m.lois.1989.thesis.txt
	bruu.m.lois.1989.thesis_1.txt
	buchanan.m.john.1989.thesis_1.txt
	buchanan.m.john.1989.thesis_2.txt
	burke.l.ann.1984.thesis.txt
	calla000.txt
	campbell..b.robert.1998.thesis_1.txt
	campbell..b.robert.1998.thesis_2.txt
	canar000.txt
	canario.a.l.1994.thesis.txt
	carty.a.david.1997.thesis.txt
	cashin.c.ronald.1991.txt
	chang.mich.goodman.eliz.2003.thesis.doc.txt
	chang.yufang.unknown.thesis.doc.txt
	chann000.txt
	cho.lisa.unknown.thesis.doc.txt
	choe.000.txt
	chow.ho-ming.sheldon.2000.thesis.txt
	chullasapya.suchard.unknown.thesis.doc.txt
	chung000.txt
	couta000.txt
	cowen000.txt
	d'aspice.deanna.thesis.txt
	degeo000.txt
	delacruz.rene.juan.2000.thesis.doc.txt
	depietro.peter.year.thesis.doc.txt
	DOC.txt
	DOC000.txt
	dubois.r.luke.2006.thesis.doc.txt
	dunlap.charles.unknown.thesis.doc.txt
	dupree.nicole.year.thesis.doc.txt
	eid.erik.thesis.txt
	engel.adam.thesis.txt
	every.van.shawn.2004.thesis.txt
	fallon.william.thesis.doc.txt
	farley.linda.thesis.doc.txt
	feid.000.txt
	gisel000.txt
	giselle.leal.year.thesis.doc.txt
	giummo.joan.unknown.thesis.doc.txt
	goff.000.txt
	gonzalez.christopher2001.thesis.doc.txt
	guerr000.txt
	halpe000.txt
	harrell.jr.d.fox.2000.txt
	hong.000.txt
	inoue.hidetaka.thesis.doc.txt
	jin.wook.sungpark.moon.2002.thesis.doc.txt
	joness.r.john.1992.thesis.txt
	jungman.s.erika.2000.thesis.txt
	kanatani.tomohiko.thesis.1993.txt
	kapla000.txt
	kemelson.joel.adam.1991.thesis.txt
	kirov000.txt
	lawso000.txt
	lim.t000.txt
	lu.jenjui.unknown.thesis.doc.txt
	mangat.chetan.thesis.doc.txt
	maras000.txt
	marko000.txt
	markowitz.c.gary.1994.thesis.txt
	marsh000.txt
	matt.berger.txt
	mcfad000.txt
	mcgar000.txt
	mcken000.txt
	meehan.j.patrick.1994.thesis.txt
	megal000.txt
	meije000.txt
	mizoh000.txt
	morley.d.beau.1999.thesis.doc.txt
	morri000.txt
	murray.m.christina.1994.thesis.txt
	ortiz000.txt
	palma000.txt
	pratt000.txt
	psomi000.txt
	ramos000.txt
	reyes000.txt
	roscoe.seidenberg.sheynkman.thesis.doc.2.txt
	roscoe.seidenberg.sheynkman.thesis.doc.txt
	rosen.zachary.unknown.thesis.doc.txt
	ruitg000.txt
	ruitg001.txt
	russe000.txt
	russe001.txt
	schul000.txt
	schwa000.txt
	sealy000.txt
	seitz.jr.robert.1988.thesis.doc.txt
	sloan.tamara.thesis.doc.txt
	smole000.txt
	smole001.txt
	spieg000.txt
	spiegel.steven.1983.thesis.doc.txt
	spraf000.txt
	spraf001.txt
	stack000.txt
	stack001.txt
	stack002.txt
	suiss000.txt
	sun.dachen.unknown.thesis.doc.txt
	sungp000.txt
	talisman.linda.thesis.txt
	tan.xiaoli..2005.thesis.doc.txt
	taylo000.txt
	thomp000.txt
	varse000.txt
	vasil000.txt
	washington.theresa.thesis.doc.txt
	weinstein.anat.unknown.thesis.doc.txt
	wolf.ahmi.ophra.mogensen.2004.thesis.doc.txt
	wu.hsinyi.unknown.thesis.doc.txt
	xu.xu000.txt
	yen.h000.txt
	yeung.marianne.miller.do.2000.thesis.doc.txt
	yokoy000.txt
	namespace :ocr do
	desc "Remove all documentation.paper values"
	task :remove_papers => :environment do
	@documentations = Documentation.all
	@documentations.each do \|documentation\|
	documentation.update_attributes(:paper => nil)
	end
	end

	desc "Upload new OCR documents, output to terminal mismatches"
	task :save_texts => :environment do

	@documentations = Documentation.all
	text_list = Dir.glob("../ThesisTXT_round2/*.txt")

	text_names = []
	text_list.each do \|text\|
	text_array = text.split("/")
	text_names << text_array[2]
	end

	matched = 0
	unmatched = 0
	matched_text_files = []
	unmatched_text_files = []
	matched_pdfs = []
	text_names.each do \|text\|
	find_match = false
	match = nil
	@documentations.each do \|documentation\|
	pdf = documentation.media_file_name
	split_pdf = pdf.split(".")
	split_text = text.split(".")
	# if split_text.length >= 3
	if split_pdf[0] == split_text[0] && split_pdf[1] == split_text[1] #&& split_pdf[2] == split_text[2]
	matched_pdfs << documentation.id
	find_match = true
	match = documentation
	end
	# end
	end
	if find_match
	matched += 1
	matched_text_files << text
	paper = File.open("../ThesisTXT_round2/#{text}").readlines
	while paper.class == Array
	paper = paper.join("\n")
	end
	match.update_attributes paper: paper
	else
	unmatched += 1
	unmatched_text_files << text
	end
	end

	puts "MATCHED = #{matched}"
	if f = File.new("#{Rails.root}/log/matched_text_files.txt", "w")
	f.write(Time.now.to_s+"\n")
	matched_text_files.each do \|e\|
	f.write(e+"\n")
	end
	f.close
	end
	puts ""
	puts "UNMATCHED = #{unmatched}"
	# puts unmatched_text_files
	if f = File.new("#{Rails.root}/log/unmatched_text_files.txt", "w")
	f.write(Time.now.to_s+"\n")
	unmatched_text_files.each do \|e\|
	f.write(e+"\n")
	end
	f.close
	end
	puts ""

	unmatched_documentations = []
	@documentations.each do \|documentation\|
	if ! matched_pdfs.include? documentation.id
	unmatched_documentations << documentation.id
	end
	end

	if f = File.new("#{Rails.root}/log/unmatched_documentation_ids.txt", "w")
	f.write(Time.now.to_s + "\n")
	unmatched_documentations.each do \|e\|
	f.write("#{e}\n")
	end
	f.close
	end

	end
	end