Skip to content

Instantly share code, notes, and snippets.

@parshap
Last active January 1, 2016 16:49
Show Gist options
  • Save parshap/8172940 to your computer and use it in GitHub Desktop.
Save parshap/8172940 to your computer and use it in GitHub Desktop.
Ruby CSV Parsing

CSVParser

Parse CSV rows by defining parsing blocks for individual columns.

Each row is parsed one-by-one. First a new Hash is initialized to store data for the row. Then, each individual column is parsed by calling matching parsing blocks. Parsing blocks are passed the column's value and header key and can set arbitrary state on the Hash for the current row.

Usage

Example:

class MyParser < CSVParser
  parse "Name" do |val|
    self[:name] = val
  end
end

parser = MyParser.new CSV.open "data.csv"
parser.each do |row|
  puts row[:name]
end

Defining Parsers

Parsing blocks are added using the CSVParser.parse class method. The first and only parameter, case, determines if the block should be executed for a particular column (by using the === operator with the column's header value). The block is passed the column value and its associated header value. The block can update the values for the current row by using self as a Hash.

Column and header values are always converted to strings and stripped of whitespace first.

class MyParser < CSVParser
  parse /^(first|last)?\W*name$/i do |val|
    self[:name] = val.capitalize
  end
end

Once Parsers

Using CSVParser.parse_once, you can define parsers that will only execute once per row at most. In the above example, if parse_once was used, the block would only be called once even with the occurrence of multiple name columns.

Default Row Values

The CSVParser#defaults method is used to generate a hash to use for each row. You can use this to set default values.

class MyParser < CSVParser
  def defaults
    { name: "User", email: [] }
  end
end

Questions

Setting values in blocks

Is the #[]= method to allow self[:something] inside the blocks good? Is it weird setting state that way? Is there a better way?

Should I maybe do something like the following so that each row's data is explicitly in scope and I don't have the #[]= magic?

class MyParser < CSVParser
  row do |data|
    parse "Name" do |val|
      data[:name] = val
    end
  end
end

Protected & Private

Are my uses correct? #[]= is protected because it needs to be called with an explicit receiver (self) but the rest of the methods are only used internally with the implicit receiver, so I think they should be private?

Instance-based parsers instead of class-based

Would it be better to define parsers by creating new instances, instead of having to define classes? For example, the API might alternatively look something like:

parser = CSVParser.new(CSV.open "data.csv") do
  defaults { { name: "User", email: [] } }

  parse "Name" do |val|
    self[:name] = val.capitalize
  end
end

parser.each do |row|
  # ...
end

Yield instead of returning?

Should I yield the parser instance to a block instead of directly returning it? I guess this doesn't make too much sense for CSVParser.new but maybe if there was a CSVParser.open class method? Or does it?

require 'csv'
class CSVParser
class << self
@@parsers = []
private
# Add a column parser
def parse(criteria, params={}, &block)
@@parsers << {
criteria: criteria,
block: block,
}.merge(params)
end
# Add a parser that will only get called once per row
def parse_once(criteria, params={}, &block)
parse criteria, {
once: true,
}.merge(params), block
end
end
def initialize(data, options={})
@csv = CSV.new(data, options.merge(headers: false))
end
include Enumerable
def each
# Get header values used later in prasing
@headers = @csv.shift.map(&:to_s).map(&:strip)
# Parse each row
@csv.each do |row|
yield parse_row row
end
end
private
def parse_row(row)
# Create a new attributes hash for this row, this will be our result
@attributes = defaults
# Keep track of which parsers have already been executed for this row
@executed = []
# Parse each column of the row
row.each_with_index do |val, i|
parse_val val.to_s.strip, @headers[i]
end
# Return the attributes that were built using #[]=
@attributes
end
# Parse a column value
def parse_val(val, key)
@@parsers.each do |parser|
# Execute any parsers that match this column
if not onced?(parser) && match?(parser, val, key)
instance_exec val, key, &parser[:block]
@executed << parser
end
end
end
# Is the parser a once parser and has already been executed for this row?
def onced?(parser)
parser[:once] && @executed.include?(parser)
end
# Does the parser criteria match the column?
def match?(parser, val, key)
parser[:criteria] === key
end
# Default hash values to use for each row
def defaults
Hash.new
end
protected
def [](name)
@attributes[name]
end
def []=(name, val)
@attributes[name] = val
end
end
class ZillowContactParser < CSVParser
parse_once "Name" do |val|
first_name, last_name = val.split(nil, 2)
# @TODO need `to_s`?
self[:contact][:first_name] = first_name
self[:contact][:last_name] = last_name
end
parse_once "Search Timeframe" do |val|
self[:contact][:timeframe] = val
end
parse_once "Email (Personal) #1" do |val|
self[:contact][:email] = split(val).join ","
end
parse_once "Contact Type" do |val|
self[:contact_types] << val
end
parse_once /^Phone (Mobile) #\d$/ do |val|
self[:phone_numbers] << {
label: "Cell",
number: val,
}
end
# Notes
[
"Note",
"Home Type",
"Latest Communication",
/^Listing #\d$/,
].each do |name|
parse name do |val|
# @TODO Include label somehow?
self[:notes] << {
content: val
}
end
end
# Property Search
parse_once "Min. Price" do |val|
self[:property_search][:price_low] = val
end
parse_once "Max. Price" do |val|
self[:property_search][:price_high] = val
end
parse /^Location #\d$/ do |val|
self[:property_search][:misc_locations] << {
name: "Other #{val}",
location_value: val,
}
end
private
def split(s)
s.split(/[\s*,;]/).map(&:strip).reject(&:empty?)
end
def defaults
{
notes: [],
contact_types: [],
phone_numbers: [],
property_search: {
misc_locations: [],
},
}
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment