Last active
December 16, 2015 16:09
-
-
Save mmmries/5460684 to your computer and use it in GitHub Desktop.
Tab-separated parsing using Ruby 2.0 CSV library
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# The main parse method is mostly borrowed from a tweet by @JEG2 | |
class StrictTsv | |
attr_reader :filepath | |
def initialize(filepath) | |
@filepath = filepath | |
end | |
def parse | |
open(filepath) do |f| | |
headers = f.gets.strip.split("\t") | |
f.each do |line| | |
fields = Hash[headers.zip(line.split("\t"))] | |
yield fields | |
end | |
end | |
end | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
tsv = Vendor::StrictTsv.new("your_file.tsv") | |
tsv.parse do |row| | |
puts row['named field'] | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'csv' | |
line = 'boogie\ttime\tis "now"' | |
begin | |
line = CSV.parse_line(line, col_sep: "\t") | |
puts "parsed correctly" | |
rescue CSV::MalformedCSVError | |
puts "failed to parse line" | |
end | |
begin | |
line = CSV.parse_line(line, col_sep: "\t", quote_char: "Ƃ") | |
puts "parsed correctly with random quote char" | |
rescue CSV::MalformedCSVError | |
puts "failed to parse line with random quote char" | |
end | |
#Output: | |
# failed to parse line | |
# parsed correctly with random quote char |
Line 12 of strict_tsv.rb needs to use line.strip.split to remove the line terminator from the last column in each row.
Having a need to parse a huge file and inspired by this snippet, we've written a bit more complex solution, with ability to switch headers on or off and access rows in both array- and hash-like way. I'd appreciate if you could take a look at it and possibly provide feedback.
Thanks,
Slotos
@Slotos your gem looks really helpful. I'm glad to have contributed in some way to it.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I found that on line 11 of tsv_example.rb I actually have to use something like
quote_char: "\u02f6"
rather than a literal weird character or ruby complains about an unexpected $end at that location... I don't understand why, but it works.