Skip to content

Instantly share code, notes, and snippets.

@shawngraham
Last active May 11, 2016 19:04
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shawngraham/88887bcb86353831b6c6f04475d228ef to your computer and use it in GitHub Desktop.
Save shawngraham/88887bcb86353831b6c6f04475d228ef to your computer and use it in GitHub Desktop.
#!/usr/bin/env ruby
# encoding: UTF-8
require 'pp'
require 'csv'
GARBAGE_REGEXEN = {
'Four Dots' => /\.\.\.\./,
'Five Non-Alphanumerics' => /\W\W\W\W\W/,
'Isolated Euro Sign' => /\S€\D/,
'Double "Low-Nine" Quotes' => /„/,
'Anomalous Pound Sign' => /£\D/,
'Caret' => /\^/,
'Guillemets' => /[«»]/,
'Double Slashes and Pipes' => /(\\\/)|(\/\\)|([\/\\]\||\|[\/\\])/,
'Bizarre Capitalization' => /([A-Z][A-Z][a-z][a-z])|([a-z][a-z][A-Z][A-Z])|([A-LN-Z][a-z][A-Z])/,
'Mixed Alphanumerics' => /(\w[^\s\w\.\-]\w).*(\w[^\s\w]\w)/
}
WHITELIST_REGEXEN = {
'Four Caps' => /[A-Z]{4,}/,
'Date' => /Date/,
'Likely year' => /1[98]\d\d|2[01]\d\d/,
'N.S.F.' => /N\.S\.F\.|Fund/,
'Lat Lon' => /Lat|Lon/,
'Old style Coordinates' => /\d\d°\s?\d\d['’]\s?[NW]/,
'Old style Minutes' => /\d\d['’]\s?[NW]/,
'Decimal Coordinates' => /\d\d°\s?[NW]/,
'Distances' => /\d?\d(\.\d+)?\s?[mkf]/,
'Caret within heading' => /[NEWS]\^s/,
'Likely Barcode' => /[l1\|]{5,}/,
'Blank Line' => /^\s+$/,
'Guillemets as bad E' => /d«t|pav«aont/
}
module Header
TERSE_HEADER="TERSE_FILE"
NOISY_HEADER="NOISY_FILE"
end
def calculate_score(filename, negative=false)
score = 0
non_blank_lines = 0
total_lines = 0
File.readlines(filename, :encoding => 'ISO-8859-1').each do |line|
line.encode!('UTF-8')
total_lines += 1
non_blank_lines += 1 if /\S/ =~ line
GARBAGE_REGEXEN.keys.each do |name|
if GARBAGE_REGEXEN[name] =~ line
unless WHITELIST_REGEXEN.values.inject(false) { |found,regex| found || regex =~ line}
# print "#{filename}: Found #{name} in #{line}!" if negative=='t'
score += 1
end
end
end
end
[score, non_blank_lines,total_lines]
end
txt_file = ARGV[0] #use argument as text file, not CSV control file
score=calculate_score(txt_file) # actually do the calculation
print score.join(',') # print the score values to STDOUT
print "\n" #newline
exit
@shawngraham
Copy link
Author

Because my source text has some weird encoding issues, lines 45 & 46 had to be adapted to cope with it. See http://smgprojects.github.io soundbashing project experiment notes. The whitelist is specific to Ben's project (re herbarium notes) so that should be adapted to one's own domain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment