In this document I use bold text to denote the name
There are 1,188 files in the ThesisTXT_round2 folder, and currently 1,094 Thesis
records on itparchive.com. This mismatch is based on file names that were not properly parsed by the original upload script.
Each Thesis
can have multiple Documentations
. A Documentation
stores the URL of the PDF, some admnistrative information about the state of the PDF and a text field titled paper
with the contents of the corresponding .txt file. We will assume that all of the text files from the second round of OCR are better than all of the first round. And since some Documentations
might not match to the new text files if we relied on the new text file replacing the contents of paper
there could be a few Documentations
which don't have their paper
field updated. So the first step is to erase the paper
field on every Documentation
.
In Terminal, navigate to the Rails project and start up the console (this is an irb session with the code from the Rails project loaded).
(bash)
rails console
Now select all Documentations
:
(rails console)
@documentations = Documentation.all
The object @documentations
is an Array
of Documentation
objects. We iterate through every Documentation
and update the paper
attribute to nil
:
(rails console)
@documentations.each do |documentation|
documentation.update_attributes(:paper => nil)
end
What we just did only changed the development database. The production database which powers itparchive.com still has all the junky paper
fields. Do everything in development first and then when it works, push it to production.
When it's time to do this process on production we will write one script to do it all, for now we will continue working in the console to see the output of every command.
Hopefully every .txt file name matches a PDF file name. And hopefully these file names follow a format similar to
LASTNAME.FIRSTNAME.YEAR.thesis.doc.(txt|PDF)
The first three sections, as divided by periods (.) are the important ones that uniquelly identify the file with the person who wrote the paper.
To load the list of .txt files into the console we run this command:
(rails console)
text_list = Dir.glob("../ThesisTXT_round2/*.txt")
Working from inside /Archive/itparchive
the text files are in the folder /Archive/ThesisTXT_round2/
. We are saving the list of filenames to the text_list
variable. These are just names of files and not the content yet.
The PDF file names are saved in the media_file_name
attribute of each Documentation
. These next lines will save every file name in to an array named pdfs
:
(rails console)
pdfs = []
@documentations.each do |documentation|
pdfs << documentation.media_file_name
end
Now it's time to compare the text names to the pdf names. Let's look at the first element of both text_list
and pdfs
.
(rails console)
text_list.first
=> "../ThesisTXT_round2/abeson.kristina.1995.thesis.doc.txt"
pdfs.first
=> "berry.sarah.2000.thesis.doc.pdf"
Notice in text_list
we have some directory information at the front of the string. We don't need that and in fact it makes it harder to compare with pdfs
. Let's get rid of it.
(rails console)
text_names = []
text_list.each do |text|
text_array = text.split("/")
text_names << text_array[2]
end
text_names.first
=> "abeson.kristina.1995.thesis.doc.txt"
Much nicer. Now if we split a text_names
element and a pdfs
element at all the periods, if the first 3 things match, we have a match.
(rails console)
text = text_names.first.split(".")
=> ["abeson", "kristina", "1995", "thesis", "doc", "txt"]
pdf = pdfs.first.split(".")
=> ["berry", "sarah", "2000", "thesis", "doc", "pdf"]
See? Let's test it out.
(rails console)
matched = 0
unmatched = 0
text_names.each do |text|
find_match = false
pdfs.each do |pdf|
split_pdf = pdf.split(".")
split_text = text.split(".")
if split_pdf[0] == split_text[0] && split_pdf[1] == split_text[1]
find_match = true
end
end
if find_match
matched += 1
else
unmatched += 1
end
end
matched
=> 1061
unmatched
=> 127
And 1061 + 127 = 1188, which is great because that is exactly how many text files we have. And it is less than the total number of PDFs (1093).
We're looking pretty good. But first let's see what those text and pdf files are that aren't matching up. And also, just reassure ourselves that there aren't duplicates.
(rails console)
pdfs.length
=> 1093
pdfs.uniq.length
=> 1093
So we don't have any duplicate PDF names. Good.
(rails console)
matched = 0
unmatched = 0
matched_text_files = []
unmatched_text_files = []
matched_pdfs = []
text_names.each do |text|
find_match = false
pdfs.each do |pdf|
split_pdf = pdf.split(".")
split_text = text.split(".")
if split_text.length >= 3
if split_pdf[0] == split_text[0] && split_pdf[1] == split_text[1] && split_pdf[2] == split_text[2]
matched_pdfs << pdf
find_match = true
end
end
end
if find_match
matched += 1
matched_text_files << text
else
unmatched += 1
unmatched_text_files << text
end
end
matched
=> 1058
unmatched
=> 130
matched + unmatched
=> 1188
Ok. I got rid of checking for the year (split_text[2]
) and saved 3 more .txt files from orphanage.
The final script I ran is in /lib/tasks/upload_ocr.rake
.
It outputs 3 files, one with the list of all the .txt files that were matched, one with the .txt files that were unmatched and then the list of Documentation
ids that were not matched.
There are 34 Documentations
in the database without text files. If each of these has a text file that simply didn't match then that means there are 103 text files that don't correspond to anything in the database. Which could mean duplicates, or Thesis projects that were missed in the initial upload.