Created
March 12, 2015 20:26
-
-
Save wsams/9a296e660229e99920ed to your computer and use it in GitHub Desktop.
Read two files and merge data with Ruby. This script came about while creating document IDs in a system. They consist of 4 characters and only contain a-z lowercase letters. I had created a list of so called "dirty words" used to filter Solr suggestions and also had a legacy list of 4 letter "dirty words". I was tasked to create a master list of…
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
legacy = 'legacy-dirt.txt' | |
solr = 'solr-dirt.txt' | |
dirt = 'dirt.txt' | |
dirt_final = 'dirt-final.txt' | |
# Read both text files into arrays | |
legacy_file_array = File.readlines(legacy); | |
solr_file_array = File.readlines(solr); | |
# Create empty dirt.txt file and open in append mode. | |
dirt_file = File.open(dirt, 'a') | |
# Add all of the solr dirt. | |
solr_file_array.each do |word| | |
dirt_file.write(word) | |
end | |
# Add legacy dirt not existing in solr dirt to dirt.txt | |
legacy_file_array.each do |word| | |
if solr_file_array.include?(word) == FALSE | |
dirt_file.write(word) | |
end | |
end | |
# Sort and write only 4 character words to dirt.txt | |
dirt_file_array = File.readlines(dirt) | |
# Create a dirt array containing only 4 character words consisting of only lowercase letters. | |
# Note == 5 because readlines places a new line character at the end. | |
dirt_of_4_letter_array = Array.new | |
dirt_file_array.each do |word| | |
if word.length == 5 && word =~ /^[a-z]+$/ | |
dirt_of_4_letter_array.push(word) | |
end | |
end | |
# Remove duplicates and sort array. | |
dirt_of_4_letter_array = dirt_of_4_letter_array.uniq.sort | |
File.open(dirt_final, 'w') { |file| file.write(dirt_of_4_letter_array.join('')) } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment