Skip to content

Instantly share code, notes, and snippets.

@stevenklise
Created February 29, 2012 19:40
Show Gist options
  • Save stevenklise/1943877 to your computer and use it in GitHub Desktop.
Save stevenklise/1943877 to your computer and use it in GitHub Desktop.

Uploading The New Batch of OCR Texts

In this document I use bold text to denote the name

There are 1,188 files in the ThesisTXT_round2 folder, and currently 1,094 Thesis records on itparchive.com. This mismatch is based on file names that were not properly parsed by the original upload script.

Each Thesis can have multiple Documentations. A Documentation stores the URL of the PDF, some admnistrative information about the state of the PDF and a text field titled paper with the contents of the corresponding .txt file. We will assume that all of the text files from the second round of OCR are better than all of the first round. And since some Documentations might not match to the new text files if we relied on the new text file replacing the contents of paper there could be a few Documentations which don't have their paper field updated. So the first step is to erase the paper field on every Documentation.

Out with the Old

In Terminal, navigate to the Rails project and start up the console (this is an irb session with the code from the Rails project loaded).

(bash)

rails console

Now select all Documentations:

(rails console)

@documentations = Documentation.all

The object @documentations is an Array of Documentation objects. We iterate through every Documentation and update the paper attribute to nil:

(rails console)

@documentations.each do |documentation|
	documentation.update_attributes(:paper => nil)
end

Experiment with the New

What we just did only changed the development database. The production database which powers itparchive.com still has all the junky paper fields. Do everything in development first and then when it works, push it to production.

When it's time to do this process on production we will write one script to do it all, for now we will continue working in the console to see the output of every command.

Get File Names

Hopefully every .txt file name matches a PDF file name. And hopefully these file names follow a format similar to

LASTNAME.FIRSTNAME.YEAR.thesis.doc.(txt|PDF)

The first three sections, as divided by periods (.) are the important ones that uniquelly identify the file with the person who wrote the paper.

To load the list of .txt files into the console we run this command:

(rails console)

text_list = Dir.glob("../ThesisTXT_round2/*.txt")

Working from inside /Archive/itparchive the text files are in the folder /Archive/ThesisTXT_round2/. We are saving the list of filenames to the text_list variable. These are just names of files and not the content yet.

The PDF file names are saved in the media_file_name attribute of each Documentation. These next lines will save every file name in to an array named pdfs:

(rails console)

pdfs = []
@documentations.each do |documentation|
	pdfs << documentation.media_file_name
end

Compare file names

Now it's time to compare the text names to the pdf names. Let's look at the first element of both text_list and pdfs.

(rails console)

text_list.first
=> "../ThesisTXT_round2/abeson.kristina.1995.thesis.doc.txt"
pdfs.first
=> "berry.sarah.2000.thesis.doc.pdf"

Notice in text_list we have some directory information at the front of the string. We don't need that and in fact it makes it harder to compare with pdfs. Let's get rid of it.

(rails console)

text_names = []
text_list.each do |text|
	text_array = text.split("/")
	text_names << text_array[2]
end
text_names.first
=> "abeson.kristina.1995.thesis.doc.txt"

Much nicer. Now if we split a text_names element and a pdfs element at all the periods, if the first 3 things match, we have a match.

(rails console)

text = text_names.first.split(".")
=> ["abeson", "kristina", "1995", "thesis", "doc", "txt"]
pdf = pdfs.first.split(".")
=> ["berry", "sarah", "2000", "thesis", "doc", "pdf"]

See? Let's test it out.

(rails console)

matched = 0
unmatched = 0
text_names.each do |text|
	find_match = false
	pdfs.each do |pdf|
		split_pdf = pdf.split(".")
		split_text = text.split(".")
		if split_pdf[0] == split_text[0] && split_pdf[1] == split_text[1]
			find_match = true
		end
	end
	if find_match
		matched += 1
	else
		unmatched += 1
	end
end
matched
=> 1061
unmatched
=> 127

And 1061 + 127 = 1188, which is great because that is exactly how many text files we have. And it is less than the total number of PDFs (1093).

In with the New

We're looking pretty good. But first let's see what those text and pdf files are that aren't matching up. And also, just reassure ourselves that there aren't duplicates.

(rails console)

pdfs.length
=> 1093
pdfs.uniq.length
=> 1093

So we don't have any duplicate PDF names. Good.

(rails console)

matched = 0
unmatched = 0
matched_text_files = []
unmatched_text_files = []
matched_pdfs = []
text_names.each do |text|
	find_match = false
	pdfs.each do |pdf|
		split_pdf = pdf.split(".")
		split_text = text.split(".")
		if split_text.length >= 3
			if split_pdf[0] == split_text[0] && split_pdf[1] == split_text[1] && split_pdf[2] == split_text[2]
				matched_pdfs << pdf
				find_match = true
			end
		end
	end
	if find_match
		matched += 1
		matched_text_files << text
	else
		unmatched += 1
		unmatched_text_files << text
	end
end
matched
=> 1058
unmatched
=> 130
matched + unmatched
=> 1188

Ok. I got rid of checking for the year (split_text[2]) and saved 3 more .txt files from orphanage.

The final script I ran is in /lib/tasks/upload_ocr.rake.

It outputs 3 files, one with the list of all the .txt files that were matched, one with the .txt files that were unmatched and then the list of Documentation ids that were not matched.

There are 34 Documentations in the database without text files. If each of these has a text file that simply didn't match then that means there are 103 text files that don't correspond to anything in the database. Which could mean duplicates, or Thesis projects that were missed in the initial upload.

20110314094754600.pdf
avDOC.PDF
barto000.PDF
barto001.PDF
berko000.PDF
blaze000.PDF
bobro000.PDF
bonaf000.PDF
borbe000.PDF
bose.000.PDF
bruu.m.lois.1989.thesis.PDF
bruu.m.lois.1989.thesis_1.PDF
buchanan.m.john.1989.thesis_1.PDF
buchanan.m.john.1989.thesis_2.PDF
burke.l.ann.1984.thesis.PDF
calla000.PDF
campbell..b.robert.1998.thesis_1.PDF
campbell..b.robert.1998.thesis_2.PDF
canar000.PDF
canario.a.l.1994.thesis.PDF
carty.a.david.1997.thesis.PDF
cashin.c.ronald.1991.PDF
chang.mich.goodman.eliz.2003.thesis.doc.PDF
chang.yufang.unknown.thesis.doc.PDF
chann000.PDF
cho.lisa.unknown.thesis.doc.PDF
choe.000.PDF
chow.ho-ming.sheldon.2000.thesis.PDF
chullasapya.suchard.unknown.thesis.doc.PDF
chung000.PDF
colichio.torry.year.thesis.doc.pdf
couta000.PDF
cowen000.PDF
d'aspice.deanna.thesis.PDF
degeo000.PDF
delacruz.rene.juan.2000.thesis.doc.PDF
depietro.peter.year.thesis.doc.PDF
DOC.PDF
DOC000.PDF
dubois.r.luke.2006.thesis.doc.PDF
dunlap.charles.unknown.thesis.doc.PDF
dupree.nicole.year.thesis.doc.PDF
eid.erik.thesis.PDF
engel.adam.thesis.PDF
every.van.shawn.2004.thesis.PDF
ezeta.ivo.thesis.doc.PDF
fallon.william.thesis.doc.PDF
farley.linda.thesis.doc.PDF
feid.000.PDF
gisel000.PDF
giselle.leal.year.thesis.doc.PDF
giummo.joan.unknown.thesis.doc.PDF
goff.000.PDF
gonzalez.christopher2001.thesis.doc.PDF
guerr000.PDF
halpe000.PDF
hamazaki.kumi.unknown.thesis.doc.PDF
harrell.jr.d.fox.2000.PDF
hong.000.PDF
inoue.hidetaka.thesis.doc.PDF
jin.wook.sungpark.moon.2002.thesis.doc.PDF
joness.r.john.1992.thesis.PDF
jungman.s.erika.2000.thesis.PDF
kanatani.tomohiko.thesis.1993.PDF
kapla000.PDF
keller.carol.year.thesis.doc.pdf
kemelson.joel.adam.1991.thesis.PDF
kirov000.PDF
langer.miriam.year.thesis.doc.pdf
last.first.year.thesis.type.pdf
lawso000.PDF
lim.t000.PDF
lu.jenjui.unknown.thesis.doc.PDF
lupo.jonathan.year.thesis.doc.pdf
mangat.chetan.thesis.doc.PDF
maras000.PDF
marko000.PDF
markowitz.c.gary.1994.thesis.PDF
marsh000.PDF
matt.berger.PDF
mcfad000.PDF
mcgar000.PDF
mcken000.PDF
meehan.j.patrick.1994.thesis.PDF
megal000.PDF
meije000.PDF
mizoh000.PDF
morley.d.beau.1999.thesis.doc.PDF
morri000.PDF
murray.m.christina.1994.thesis.PDF
ortiz000.PDF
palma000.PDF
pollex.alessandro.year.thesis.doc.pdf
pratt000.PDF
psomi000.PDF
raffel.daniel.year.thesis.doc.PDF
ramos000.PDF
reyes000.PDF
roscoe.seidenberg.sheynkman.thesis.doc.2.PDF
roscoe.seidenberg.sheynkman.thesis.doc.PDF
rosen.zachary.unknown.thesis.doc.PDF
ruitg000.PDF
ruitg001.PDF
russe000.PDF
russe001.PDF
schul000.PDF
schwa000.PDF
sealy000.PDF
seitz.jr.robert.1988.thesis.doc.PDF
sloan.tamara.thesis.doc.PDF
smole000.PDF
smole001.PDF
spraf000.PDF
spraf001.PDF
stack000.PDF
stack001.PDF
stack002.PDF
suiss000.PDF
sun.dachen.unknown.thesis.doc.PDF
sungp000.PDF
talisman.linda.thesis.PDF
tan.xiaoli..2005.thesis.doc.PDF
taylo000.PDF
thomp000.PDF
varse000.PDF
vasil000.PDF
washington.theresa.thesis.doc.PDF
weinstein.anat.unknown.thesis.doc.PDF
wolf.ahmi.ophra.mogensen.2004.thesis.doc.PDF
wu.hsinyi.unknown.thesis.doc.PDF
xu.xu000.PDF
yen.h000.PDF
yeung.marianne.miller.do.2000.thesis.doc.PDF
yokoy000.PDF
amcbrochure.DOC.txt
avDOC.txt
barto000.txt
barto001.txt
berko000.txt
blaze000.txt
bobro000.txt
bonaf000.txt
borbe000.txt
bose.000.txt
bruu.m.lois.1989.thesis.txt
bruu.m.lois.1989.thesis_1.txt
buchanan.m.john.1989.thesis_1.txt
buchanan.m.john.1989.thesis_2.txt
burke.l.ann.1984.thesis.txt
calla000.txt
campbell..b.robert.1998.thesis_1.txt
campbell..b.robert.1998.thesis_2.txt
canar000.txt
canario.a.l.1994.thesis.txt
carty.a.david.1997.thesis.txt
cashin.c.ronald.1991.txt
chang.mich.goodman.eliz.2003.thesis.doc.txt
chang.yufang.unknown.thesis.doc.txt
chann000.txt
cho.lisa.unknown.thesis.doc.txt
choe.000.txt
chow.ho-ming.sheldon.2000.thesis.txt
chullasapya.suchard.unknown.thesis.doc.txt
chung000.txt
couta000.txt
cowen000.txt
d'aspice.deanna.thesis.txt
degeo000.txt
delacruz.rene.juan.2000.thesis.doc.txt
depietro.peter.year.thesis.doc.txt
DOC.txt
DOC000.txt
dubois.r.luke.2006.thesis.doc.txt
dunlap.charles.unknown.thesis.doc.txt
dupree.nicole.year.thesis.doc.txt
eid.erik.thesis.txt
engel.adam.thesis.txt
every.van.shawn.2004.thesis.txt
fallon.william.thesis.doc.txt
farley.linda.thesis.doc.txt
feid.000.txt
gisel000.txt
giselle.leal.year.thesis.doc.txt
giummo.joan.unknown.thesis.doc.txt
goff.000.txt
gonzalez.christopher2001.thesis.doc.txt
guerr000.txt
halpe000.txt
harrell.jr.d.fox.2000.txt
hong.000.txt
inoue.hidetaka.thesis.doc.txt
jin.wook.sungpark.moon.2002.thesis.doc.txt
joness.r.john.1992.thesis.txt
jungman.s.erika.2000.thesis.txt
kanatani.tomohiko.thesis.1993.txt
kapla000.txt
kemelson.joel.adam.1991.thesis.txt
kirov000.txt
lawso000.txt
lim.t000.txt
lu.jenjui.unknown.thesis.doc.txt
mangat.chetan.thesis.doc.txt
maras000.txt
marko000.txt
markowitz.c.gary.1994.thesis.txt
marsh000.txt
matt.berger.txt
mcfad000.txt
mcgar000.txt
mcken000.txt
meehan.j.patrick.1994.thesis.txt
megal000.txt
meije000.txt
mizoh000.txt
morley.d.beau.1999.thesis.doc.txt
morri000.txt
murray.m.christina.1994.thesis.txt
ortiz000.txt
palma000.txt
pratt000.txt
psomi000.txt
ramos000.txt
reyes000.txt
roscoe.seidenberg.sheynkman.thesis.doc.2.txt
roscoe.seidenberg.sheynkman.thesis.doc.txt
rosen.zachary.unknown.thesis.doc.txt
ruitg000.txt
ruitg001.txt
russe000.txt
russe001.txt
schul000.txt
schwa000.txt
sealy000.txt
seitz.jr.robert.1988.thesis.doc.txt
sloan.tamara.thesis.doc.txt
smole000.txt
smole001.txt
spieg000.txt
spiegel.steven.1983.thesis.doc.txt
spraf000.txt
spraf001.txt
stack000.txt
stack001.txt
stack002.txt
suiss000.txt
sun.dachen.unknown.thesis.doc.txt
sungp000.txt
talisman.linda.thesis.txt
tan.xiaoli..2005.thesis.doc.txt
taylo000.txt
thomp000.txt
varse000.txt
vasil000.txt
washington.theresa.thesis.doc.txt
weinstein.anat.unknown.thesis.doc.txt
wolf.ahmi.ophra.mogensen.2004.thesis.doc.txt
wu.hsinyi.unknown.thesis.doc.txt
xu.xu000.txt
yen.h000.txt
yeung.marianne.miller.do.2000.thesis.doc.txt
yokoy000.txt
amcbrochure.DOC.txt
avDOC.txt
barto000.txt
barto001.txt
berko000.txt
blaze000.txt
bobro000.txt
bonaf000.txt
borbe000.txt
bose.000.txt
bruu.m.lois.1989.thesis.txt
bruu.m.lois.1989.thesis_1.txt
buchanan.m.john.1989.thesis_1.txt
buchanan.m.john.1989.thesis_2.txt
burke.l.ann.1984.thesis.txt
calla000.txt
campbell..b.robert.1998.thesis_1.txt
campbell..b.robert.1998.thesis_2.txt
canar000.txt
canario.a.l.1994.thesis.txt
carty.a.david.1997.thesis.txt
cashin.c.ronald.1991.txt
chang.mich.goodman.eliz.2003.thesis.doc.txt
chang.yufang.unknown.thesis.doc.txt
chann000.txt
cho.lisa.unknown.thesis.doc.txt
choe.000.txt
chow.ho-ming.sheldon.2000.thesis.txt
chullasapya.suchard.unknown.thesis.doc.txt
chung000.txt
couta000.txt
cowen000.txt
d'aspice.deanna.thesis.txt
degeo000.txt
delacruz.rene.juan.2000.thesis.doc.txt
depietro.peter.year.thesis.doc.txt
DOC.txt
DOC000.txt
dubois.r.luke.2006.thesis.doc.txt
dunlap.charles.unknown.thesis.doc.txt
dupree.nicole.year.thesis.doc.txt
eid.erik.thesis.txt
engel.adam.thesis.txt
every.van.shawn.2004.thesis.txt
fallon.william.thesis.doc.txt
farley.linda.thesis.doc.txt
feid.000.txt
gisel000.txt
giselle.leal.year.thesis.doc.txt
giummo.joan.unknown.thesis.doc.txt
goff.000.txt
gonzalez.christopher2001.thesis.doc.txt
guerr000.txt
halpe000.txt
harrell.jr.d.fox.2000.txt
hong.000.txt
inoue.hidetaka.thesis.doc.txt
jin.wook.sungpark.moon.2002.thesis.doc.txt
joness.r.john.1992.thesis.txt
jungman.s.erika.2000.thesis.txt
kanatani.tomohiko.thesis.1993.txt
kapla000.txt
kemelson.joel.adam.1991.thesis.txt
kirov000.txt
lawso000.txt
lim.t000.txt
lu.jenjui.unknown.thesis.doc.txt
mangat.chetan.thesis.doc.txt
maras000.txt
marko000.txt
markowitz.c.gary.1994.thesis.txt
marsh000.txt
matt.berger.txt
mcfad000.txt
mcgar000.txt
mcken000.txt
meehan.j.patrick.1994.thesis.txt
megal000.txt
meije000.txt
mizoh000.txt
morley.d.beau.1999.thesis.doc.txt
morri000.txt
murray.m.christina.1994.thesis.txt
ortiz000.txt
palma000.txt
pratt000.txt
psomi000.txt
ramos000.txt
reyes000.txt
roscoe.seidenberg.sheynkman.thesis.doc.2.txt
roscoe.seidenberg.sheynkman.thesis.doc.txt
rosen.zachary.unknown.thesis.doc.txt
ruitg000.txt
ruitg001.txt
russe000.txt
russe001.txt
schul000.txt
schwa000.txt
sealy000.txt
seitz.jr.robert.1988.thesis.doc.txt
sloan.tamara.thesis.doc.txt
smole000.txt
smole001.txt
spieg000.txt
spiegel.steven.1983.thesis.doc.txt
spraf000.txt
spraf001.txt
stack000.txt
stack001.txt
stack002.txt
suiss000.txt
sun.dachen.unknown.thesis.doc.txt
sungp000.txt
talisman.linda.thesis.txt
tan.xiaoli..2005.thesis.doc.txt
taylo000.txt
thomp000.txt
varse000.txt
vasil000.txt
washington.theresa.thesis.doc.txt
weinstein.anat.unknown.thesis.doc.txt
wolf.ahmi.ophra.mogensen.2004.thesis.doc.txt
wu.hsinyi.unknown.thesis.doc.txt
xu.xu000.txt
yen.h000.txt
yeung.marianne.miller.do.2000.thesis.doc.txt
yokoy000.txt
namespace :ocr do
desc "Remove all documentation.paper values"
task :remove_papers => :environment do
@documentations = Documentation.all
@documentations.each do |documentation|
documentation.update_attributes(:paper => nil)
end
end
desc "Upload new OCR documents, output to terminal mismatches"
task :save_texts => :environment do
@documentations = Documentation.all
text_list = Dir.glob("../ThesisTXT_round2/*.txt")
text_names = []
text_list.each do |text|
text_array = text.split("/")
text_names << text_array[2]
end
matched = 0
unmatched = 0
matched_text_files = []
unmatched_text_files = []
matched_pdfs = []
text_names.each do |text|
find_match = false
match = nil
@documentations.each do |documentation|
pdf = documentation.media_file_name
split_pdf = pdf.split(".")
split_text = text.split(".")
# if split_text.length >= 3
if split_pdf[0] == split_text[0] && split_pdf[1] == split_text[1] #&& split_pdf[2] == split_text[2]
matched_pdfs << documentation.id
find_match = true
match = documentation
end
# end
end
if find_match
matched += 1
matched_text_files << text
paper = File.open("../ThesisTXT_round2/#{text}").readlines
while paper.class == Array
paper = paper.join("\n")
end
match.update_attributes paper: paper
else
unmatched += 1
unmatched_text_files << text
end
end
puts "MATCHED = #{matched}"
if f = File.new("#{Rails.root}/log/matched_text_files.txt", "w")
f.write(Time.now.to_s+"\n")
matched_text_files.each do |e|
f.write(e+"\n")
end
f.close
end
puts ""
puts "UNMATCHED = #{unmatched}"
# puts unmatched_text_files
if f = File.new("#{Rails.root}/log/unmatched_text_files.txt", "w")
f.write(Time.now.to_s+"\n")
unmatched_text_files.each do |e|
f.write(e+"\n")
end
f.close
end
puts ""
unmatched_documentations = []
@documentations.each do |documentation|
if ! matched_pdfs.include? documentation.id
unmatched_documentations << documentation.id
end
end
if f = File.new("#{Rails.root}/log/unmatched_documentation_ids.txt", "w")
f.write(Time.now.to_s + "\n")
unmatched_documentations.each do |e|
f.write("#{e}\n")
end
f.close
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment