Skip to content

Instantly share code, notes, and snippets.

@TehShrike
Created March 13, 2024 15:33
Show Gist options
  • Save TehShrike/1d2295956c56052044df1cbf12e3cfec to your computer and use it in GitHub Desktop.
Save TehShrike/1d2295956c56052044df1cbf12e3cfec to your computer and use it in GitHub Desktop.
parse a csv line
import test from 'lib/tester';
const parse_csv_line = (line: string): string[] => {
const chunks: string[] = [];
let inside_quotes = false;
let current_chunk = '';
for (let i = 0; i < line.length; ++i) {
const char = line[i];
if (char === ',' && !inside_quotes) {
chunks.push(current_chunk);
current_chunk = '';
} else if (char !== '"') {
current_chunk += char;
} else {
const is_escaping_a_quote = line.length > i + 1 && line[i + 1] === '"';
if (is_escaping_a_quote) {
current_chunk += '"';
i++;
} else {
inside_quotes = !inside_quotes;
}
}
}
chunks.push(current_chunk);
return chunks;
};
test('parse some csv lines', (assert) => {
assert.equal(parse_csv_line(``), ['']);
assert.equal(parse_csv_line(`greetings`), ['greetings']);
assert.equal(parse_csv_line(`greetings,earthling`), ['greetings', 'earthling']);
assert.equal(parse_csv_line(`"greetings,earthling"`), ['greetings,earthling']);
assert.equal(parse_csv_line(`greetings,"bob, billy, both earthlings"`), [
'greetings',
'bob, billy, both earthlings',
]);
assert.equal(parse_csv_line(`greetings,"bob, billy, both ""earthlings"""`), [
'greetings',
'bob, billy, both "earthlings"',
]);
assert.equal(parse_csv_line(`greetings,"bob, billy, both ""earthlings""",I come in peace`), [
'greetings',
'bob, billy, both "earthlings"',
'I come in peace',
]);
});
@Conduitry
Copy link

In case you're interested, I've made some further adjustments to this as I converted it into something that operates on a readable web stream.

This code doesn't properly handle cases like, say, a,b,"",d, where an empty column is (unnecessarily) represented as "" rather than as just an empty string. This gets parsed by the above code as ['a', 'b', '"', 'd'], because it sees the first quote, it peeks and sees the next character is a quote, and it interprets those two together as a quote. "" should, I believe, only be interpreted as " when the first " made you exit quoted mode and the second one makes you enter it again.

Adjacently, when inputting a stream, there's the very annoying edge case to worry about where a chunk boundary might fall between those two quotes, and it's not necessarily easy to look at what the next character is.

Combining these all together, I made the following adjustments. Encountering " always makes you toggle inside_quotes mode. (This is safe since the second " you encounter would make you enter it again anyway.) And this also lets me have a condition based on the previous character in the string, which I'm guaranteed to always have available to me. The replacement char === '"' branch case in your code above would then be something like:

inside_quotes = !inside_quotes;
if (inside_quotes && line[i - 1] === '"') {
	current_chunk += '"';
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment