Skip to content

Instantly share code, notes, and snippets.

@wsams
Created March 12, 2015 20:26
Show Gist options
  • Save wsams/9a296e660229e99920ed to your computer and use it in GitHub Desktop.
Save wsams/9a296e660229e99920ed to your computer and use it in GitHub Desktop.
Read two files and merge data with Ruby. This script came about while creating document IDs in a system. They consist of 4 characters and only contain a-z lowercase letters. I had created a list of so called "dirty words" used to filter Solr suggestions and also had a legacy list of 4 letter "dirty words". I was tasked to create a master list of…
legacy = 'legacy-dirt.txt'
solr = 'solr-dirt.txt'
dirt = 'dirt.txt'
dirt_final = 'dirt-final.txt'
# Read both text files into arrays
legacy_file_array = File.readlines(legacy);
solr_file_array = File.readlines(solr);
# Create empty dirt.txt file and open in append mode.
dirt_file = File.open(dirt, 'a')
# Add all of the solr dirt.
solr_file_array.each do |word|
dirt_file.write(word)
end
# Add legacy dirt not existing in solr dirt to dirt.txt
legacy_file_array.each do |word|
if solr_file_array.include?(word) == FALSE
dirt_file.write(word)
end
end
# Sort and write only 4 character words to dirt.txt
dirt_file_array = File.readlines(dirt)
# Create a dirt array containing only 4 character words consisting of only lowercase letters.
# Note == 5 because readlines places a new line character at the end.
dirt_of_4_letter_array = Array.new
dirt_file_array.each do |word|
if word.length == 5 && word =~ /^[a-z]+$/
dirt_of_4_letter_array.push(word)
end
end
# Remove duplicates and sort array.
dirt_of_4_letter_array = dirt_of_4_letter_array.uniq.sort
File.open(dirt_final, 'w') { |file| file.write(dirt_of_4_letter_array.join('')) }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment