So today I was experimenting with various languages in order to make the GHTorrent MySQL "CSV" dumps to behave like RFC-compliant CSV files. This involved parsing multi-GB, UTF-8 encoded files and running a small state-machine at the character level. I started with Ruby, but it was slow:
$ time ruby csvify.rb projects.csv >/dev/null
real 0m36.714s
user 0m35.689s