Skip to content

Instantly share code, notes, and snippets.

@zacharysyoung
Last active November 17, 2023 20:22
Show Gist options
  • Save zacharysyoung/8c6420ed780cbf2bfe65ffb9f3efca4f to your computer and use it in GitHub Desktop.
Save zacharysyoung/8c6420ed780cbf2bfe65ffb9f3efca4f to your computer and use it in GitHub Desktop.
To Go's encoding/csv: let my data be.

Let my data be

Go's encoding/csv Reader type takes the novel (to me) approach of deciding that carriage return line feeds (CRLFs) should be replaced with newlines (LFs).

It not only replaces CRLFs that mark then end of one record and the beginning of the next—the encoding of the data—it replaces all CRLFs at the end of any line of text—the data itself.

The CSV:

ID,Data
1,"Let\r\nmy\ndata\rbe"
2,"foo\n\rnbar\nbaz"

will be read as:

ID,Data
1,"Let\nmy\ndata\rbe"
2,"foo\nbar\nbaz"

The data (between the quotes) Let\r\nmy\ndata\rbe and foo\n\rnbar\nbaz will be transformed to Let\nmy\ndata\rbe and foo\nbar\nbaz. The CRs in the CRLFs between "Let" and "my", and "foo" and "bar", have been dropped leaving only the LFs. The lone CR between "data" and "be" was left.

As stated in the documentation:

Carriage returns before newline characters are silently removed.

and the code bears this out:

// Normalize \r\n to \n on all input lines.
if n := len(line); n >= 2 && line[n-2] == '\r' && line[n-1] == '\n' {
	line[n-2] = '\n'
	line = line[:n-1]
}

A comment for the Reader justifies this behavior:

// The Reader converts all \r\n sequences in its input to plain \n,
// including in multiline field values, so that the returned data does
// not depend on which line-ending convention an input file uses.

The csv package uses the term field in this context to refer to the data; and the comment shows that Go knows it will be transforming the data itself and not just the encoding around/of the data.

For comparison, imagine encoding/json doing the same:

{"1": "Let\\r\\nmy\\ndata\\rbe"}

becomes:

{"1": "Let\\nmy\\ndata\\rbe"}

so as to save people/data from a "certain line-ending dependency".

Even more odd, Go's csv.Writer does not share the same philosophy and encodes whatever field-data was passed:

var b bytes.Buffer
w := csv.NewWriter(&b)
w.Write([]string{"ID", "Data"})
w.Write([]string{"1", "Let\r\nmy\ndata\rbe"})
w.Flush()
fmt.Printf("%+q", b.String())

yields:

"ID,Data\n1,\"Let\r\nmy\ndata\rbe\"\n"

The Writer faithfully represents the data, but not the Reader. [:gopher_dunno:]

Russ defends Go's position for removing CR here.

package main
import (
"bytes"
"encoding/csv"
"encoding/json"
"encoding/xml"
"fmt"
)
func main() {
// myData contains CRLF which csv.Reader will transform to LF; Reader doesn't
// mind lone CR at end of line. csv.Writer doesn't care at all.
//
// The JSON and XML encoders/decoders have no philosophy on line endings **in data**.
const myData = "let\r\nmy\ndata\rbe"
var b []byte // the encoded bytes
var s string // the (re)decoded string
b = encJSON(myData)
fmt.Printf("encoded JSON: %+q\n", string(b))
s = decJSON(b)
fmt.Printf("decoded JSON %+q\n", s)
b = encXML(myData)
fmt.Printf("encoded XML %+q\n", string(b))
s = decXML(b)
fmt.Printf("decoded XML %+q\n", s)
b = encCSV(myData)
fmt.Printf("encoded CSV %+q\n", string(b))
s = decCSV(b)
fmt.Printf("decoded CSV %+q\n", s)
// Output:
// encoded JSON "\"let\\r\\nmy\\ndata\\rbe\""
// decoded JSON "let\r\nmy\ndata\rbe"
// encoded XML "<root>let\r\nmy\ndata\rbe</root>"
// decoded XML "let\r\nmy\ndata\rbe"
// encoded CSV "\"let\r\nmy\ndata\rbe\"\n"
// decoded CSV "let\nmy\ndata\rbe"
}
func encCSV(s string) []byte {
var b bytes.Buffer
w := csv.NewWriter(&b)
w.Write([]string{s})
w.Flush()
return b.Bytes()
}
func decCSV(b []byte) string {
r := csv.NewReader(bytes.NewReader(b))
record, _ := r.Read()
return record[0]
}
func encJSON(s string) []byte {
b, _ := json.Marshal(s)
return b
}
func decJSON(b []byte) string {
s := ""
json.Unmarshal(b, &s)
return s
}
type elem struct {
XMLName xml.Name `xml:"root"`
Value string `xml:",innerxml"`
}
func encXML(s string) []byte {
elem := elem{Value: s}
b, _ := xml.Marshal(elem)
return b
}
func decXML(b []byte) string {
var elem elem
xml.Unmarshal(b, &elem)
return elem.Value
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment