Skip to content

Instantly share code, notes, and snippets.

@Bajena
Created January 29, 2020 06:46
Show Gist options
  • Save Bajena/8412fdc8e0613938a652cd4c78fd31b2 to your computer and use it in GitHub Desktop.
Save Bajena/8412fdc8e0613938a652cd4c78fd31b2 to your computer and use it in GitHub Desktop.
class Loader
def load
Enumerator.new { |main_enum| stream(main_enum) }
end
private
def stream(main_enum)
reader = nil
file_uri.open do |file|
reader = Zlib::GzipReader.new(file)
reader.each_line.lazy.drop(1).each do |line|
main_enum << preprocess_row(line)
end
end
ensure
reader&.close
end
def file_uri
URI.parse("ftp://user:password@host.com/file.csv.gz")
end
def preprocess_row(row)
row.chomp.gsub('"', "").split(",")
end
end
@SampsonCrowley
Copy link

Why are you stripping out the main quote character for CSVs, this absolutely the wrong way to parse CSV data except the most basic input

"single, column value containing quotes "" , still in first col",second column,"third"

is going to be parsed completely incorrectly by your preprocess_row function

the correct output would be:

[
  'single column value containing quotes ", still in first col',
  'second column',
  'third'
]

What you're going to get

[
  'single column value containing quotes',
  'still in first col',
  'second column',
  'third'
]

@SampsonCrowley
Copy link

building a proper streaming CSV parser, you would actually open an IO object, pass that into CSV.foreach, and then feed each line into the IO

@SampsonCrowley
Copy link

what about CSVs containing quoted newlines, nested quotes, etc?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment