encoding/csv: add the ability to get the line number of a record #44221

rogpeppe · 2021-02-11T14:17:15Z

Issue #26679 was closed as being a niche feature, but it's actually quite common to need this information.

For example, after a record has been read, the syntax of its fields might be checked, and could be invalid. In this case, it's useful to be able to produce a good error message that points, at least roughly, to where the problem is. When CSV files can be millions of lines long, this is very important.

The workaround proposed there is awkward to implement (it's easy to get an io.Reader implementation wrong) and doesn't work currently anyway. The only other current alternative is to wrap the underlying reader with a reader that counts lines and only reads a line at a time, which is again awkward to implement, and introduces a second unecessary layer of buffering to slow things down.

As csv.Reader already keeps track of line numbers, providing that information in the API seems natural.

We propose a new method:

// Line returns the line number of the start of the most
// recently returned record. If no records have been returned,
// it returns zero.
func (r *Reader) Line() int

There's a straw-man implementation here.

Thanks to @mvdan for consultation on this issue.

The text was updated successfully, but these errors were encountered:

peterGo · 2021-02-24T15:12:54Z

Your proposal

// Line returns the line number of the start of the most
// recently returned record. If no records have been returned,
// it returns zero.
func (r *Reader) Line() int

Line is already defined as

// A ParseError is returned for parsing errors.
// Line numbers are 1-indexed and columns are 0-indexed.
type ParseError struct {
    StartLine int   // Line where the record starts; added in Go 1.10
    Line      int   // Line where the error occurred
    Column    int   // Column (rune index) where the error occurred
    Err       error // The actual error
}

You seem to be using Line to refer to StartLine. That's confusing. Rename the method to to StartLine.

Your straw-person implementation reflects the confusion.

// Line returns the line number of the start of the most
// recently returned record. If no records have been returned,
// it returns zero.
func (r *Reader) Line() int {
    return r.startLine
}

The method documentation should be precise. Add indexing information from the ParseError documentation.

Revised method

// StartLine returns the line number of the start of the most
// recently returned record. If no records have been returned,
// it returns zero.
// StartLine numbers are 1-indexed.
func (r *Reader) StartLine() int

mvdan · 2021-02-24T15:30:34Z

I actually suggested renaming the StartLine method to Line, since otherwise it seems like the line number for the start of the entire CSV document instead of the current record.

rogpeppe · 2021-02-24T16:11:54Z

As @mvdan said, my initial version of the proposal used StartLine rather than Line. I'm easy either way tbh. Any other votes?

rsc · 2021-02-24T18:40:14Z

Maybe it should be

// Pos returns the position of the most recently read record.
func (r *Reader) Pos() (startLine, endLine int)

It seems like if you want one you might want both?

rsc · 2021-02-24T19:46:16Z

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

rogpeppe · 2021-02-24T21:04:19Z

I've never seen any requirement for finding the last line in particular.

What would be more useful however would be the ability to find the line and column for any given field on the current line. I'm not sure: is the CSV encoding reversible? If so, maybe it would be possible to determine that given the current line number and the current field values, which could provide the last line number as a natural consequence.

rsc · 2021-03-10T18:53:10Z

Even if you don't need the endLine normally, having Pos return both makes clear that they might be different, which could be useful.
It could even be called LastPos instead of Pos.
We could also save column info for the fields easily enough. It's a tiny amount of data and would be reused.
What API would you want to return the per-field info?

rogpeppe · 2021-03-11T07:43:50Z

How about something like this?

// FieldPos returns the line and column corresponding to
// the field with the given index in the entry most recently
// returned by Read. Numbering of lines and columns starts at 1.
//
// If this is called with an out-of-bounds index, it panics.
func (r *Reader) FieldPos(index int) (line, column int)

Then the equivalent of my original Line method can be obtained
by calling FieldPos(0).

rsc · 2021-03-24T17:38:34Z

@rogpeppe, can you confirm that the API you suggested in the previous comment is implementable?
Assuming it is, it sounds good.

Does anyone object to the FieldPos API?

rogpeppe · 2021-03-24T18:16:45Z

I'm pretty sure it's implementable. My main concern would be whether it can be implemented while adding negligible runtime overhead.

rsc · 2021-03-31T17:26:28Z

It occurred to me at some point over the past week that the Reader just needs a []int that it reuses across each decoding. That can't possibly be much overhead.

rogpeppe · 2021-03-31T22:10:09Z

It occurred to me at some point over the past week that the Reader just needs a []int that it reuses across each decoding

Doesn't it need to be a slice of (line, column) pairs? Still not much overhead, but it might be non-negligible for some.
I guess if the line numbers are relative to the line number of the start of the line, then we don't need to worry about the line part overflowing a 32 bit int.

rsc · 2021-04-01T03:16:26Z

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

rogpeppe · 2021-04-01T15:30:36Z

One question here: what is the correct unit to use for the column: bytes or runes? Currently the error counts columns in runes, but as a counter-example, the Go compiler's positions count columns in bytes. Counting in bytes is cheaper and easier, and I'd be inclined to do that except for the currently documented and exported semantics of the Column field in the ParseError type.

Would it be too bad if FieldPos returned the column as a byte offset, inconsistently from the column in ParseError ?

I was writing some code that calculates rune offsets lazily when FieldPos is called to avoid the overhead of calculating them on the fly, but it's a pain to do properly, especially in the presence of TrimLeadingSpace which can trim unicode characters that don't end up in the resulting field data.

rsc · 2021-04-01T15:38:44Z

Runes is pointless; please do bytes. :-)

rogpeppe · 2021-04-01T17:39:10Z

What do you think about changing the existing behaviour of the column in the ParseError ?

rsc · 2021-04-01T18:58:06Z

I did a quick scan of my Go corpus from spring 2020. I found zero uses of the Column field, so I think it is safe to change.
I found some tests but they were only checking ASCII inputs.
I found some other implementations setting Column in their own csv.ParseErrors:

biogo thinks Column is the integer field number, not the rune offset.
dolt sets Column correctly to a rune index, but again there are zero uses of the field.

If we made the change I think it would be easy to approach the biogo and dolt folks to get their implementations fixed.

Here's the actual instances:

Correctly computing rune-based Column (would be affected)

github.com/dolthub/dolt/go@v0.0.0-20200327004621-c4414210ff5e/libraries/doltcore/table/untyped/csv/reader.go:385:				err = &csv.ParseError{StartLine: recordStartLine, Line: csvr.numLine, Column: col, Err: csv.ErrQuote}
github.com/dolthub/dolt/go@v0.0.0-20200327004621-c4414210ff5e/libraries/doltcore/table/untyped/csv/reader.go:403:				err = &csv.ParseError{StartLine: recordStartLine, Line: csvr.numLine, Column: col, Err: csv.ErrQuote}

Has tests checking Column but they're ASCII only, and the implementation is wrapping encoding/csv (would be unaffected):

github.com/mzimmerman/multicorecsv@v0.0.0-20160707020140-e722d0bc1822/multicorecsv_test.go:306:			} else if tt.Line != 0 && (tt.Line != perr.Line || tt.Column != perr.Column) {

Hard-coded Column: 1 (unaffected):

github.com/jszwec/csvutil@v1.2.3/csvutil_go110_test.go:10:var testUnmarshalInvalidFirstLineErr = &csv.ParseError{
github.com/jszwec/csvutil@v1.2.3/csvutil_go110_test.go:17:var testUnmarshalInvalidSecondLineErr = &csv.ParseError{
github.com/jszwec/csvutil@v1.2.3/csvutil_go17_test.go:7:var testUnmarshalInvalidFirstLineErr = &csv.ParseError{
github.com/jszwec/csvutil@v1.2.3/csvutil_go17_test.go:13:var testUnmarshalInvalidSecondLineErr = &csv.ParseError{

Thinks Column is CSV field index (unaffected):

github.com/biogo/biogo@v1.0.2/io/featio/bed/bed.go:100:		panic(&csv.ParseError{Column: column, Err: err})
github.com/biogo/biogo@v1.0.2/io/featio/bed/bed.go:108:		panic(&csv.ParseError{Column: column, Err: err})
github.com/biogo/biogo@v1.0.2/io/featio/bed/bed.go:126:		panic(&csv.ParseError{Column: index, Err: ErrBadStrandField})
github.com/biogo/biogo@v1.0.2/io/featio/bed/bed.go:130:		panic(&csv.ParseError{Column: index, Err: ErrBadStrand})
github.com/biogo/biogo@v1.0.2/io/featio/bed/bed.go:142:		panic(&csv.ParseError{Column: index, Err: ErrBadColorField})
github.com/biogo/biogo@v1.0.2/io/featio/bed/bed.go:402:		if err, ok := err.(*csv.ParseError); ok {
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:241:		panic(&csv.ParseError{Line: line, Column: index, Err: err})
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:252:		panic(&csv.ParseError{Line: line, Column: index, Err: err})
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:263:		panic(&csv.ParseError{Line: line, Column: index, Err: err})
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:281:		panic(&csv.ParseError{Line: line, Column: index, Err: ErrBadStrandField})
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:285:		panic(&csv.ParseError{Line: line, Column: index, Err: ErrBadStrand})
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:312:				panic(&csv.ParseError{Line: line, Column: column, Err: ErrBadTag})
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:338:			panic(&csv.ParseError{Line: line, Column: index, Err: ErrBadTag})
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:376:		return nil, &csv.ParseError{Line: r.line, Err: ErrEmptyMetaLine}
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:382:			return nil, &csv.ParseError{Line: r.line, Err: ErrNotHandled}
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:388:			return nil, &csv.ParseError{Line: r.line, Err: ErrBadMetaLine}
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:394:			return nil, &csv.ParseError{Line: r.line, Err: ErrBadMetaLine}
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:405:			return nil, &csv.ParseError{Line: r.line, Err: ErrBadMetaLine}
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:414:			return nil, &csv.ParseError{Line: r.line, Err: ErrBadMetaLine}
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:423:			return nil, &csv.ParseError{Line: r.line, Err: ErrBadMetaLine}
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:427:		return nil, &csv.ParseError{Line: r.line, Err: ErrNotHandled}
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:441:			return nil, &csv.ParseError{Line: r.line, Err: err}
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:449:			return nil, &csv.ParseError{Line: r.line, Err: ErrBadSequence}
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:488:			return nil, &csv.ParseError{Line: r.line, Err: err}
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:504:		return nil, &csv.ParseError{Line: r.line, Column: len(fields), Err: ErrFieldMissing}
github.com/biogo/biogo@v1.0.2/io/seqio/fai/error.go:11:func parseError(line, column int, err error) *csv.ParseError {
github.com/biogo/biogo@v1.0.2/io/seqio/fai/error.go:12:	return &csv.ParseError{
github.com/biogo/biogo@v1.0.2/io/seqio/fai/error_go1.10.go:11:func parseError(line, column int, err error) *csv.ParseError {
github.com/biogo/biogo@v1.0.2/io/seqio/fai/error_go1.10.go:12:	return &csv.ParseError{
github.com/biogo/biogo@v1.0.2/io/seqio/fai/fai.go:71:// contains non-unique records the error is a csv.ParseError identifying the second non-unique
github.com/biogo/biogo@v1.0.2/io/seqio/fai/fai.go:84:			if _, ok = r.(*csv.ParseError); !ok {
github.com/biogo/hts@v1.1.0/fai/fai.go:127:func parseError(line, column int, err error) *csv.ParseError {
github.com/biogo/hts@v1.1.0/fai/fai.go:128:	return &csv.ParseError{

gopherbot · 2021-04-02T11:54:11Z

Change https://golang.org/cl/291290 mentions this issue: encoding/csv: add FieldPos method

rogpeppe · 2021-04-02T13:20:59Z

OK, I've updated https://go-review.googlesource.com/c/go/+/291290 to implement the above discussed semantics, and it also changes the semantics of ParseError.Column to be the same as go/token.Position (i.e. byte-base, one-based index).

Unfortunately the performance difference is noticeable (I think that's probably because we're appending 128 bits, not 64 bits to the fieldInfo slice, but I'm not sure).

rsc · 2021-04-07T18:02:07Z

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

[This comment](golang/go#44221 (comment)) related to columns in the encoding/csv package made me realise that we should do byte offset, not rune offset.

rogpeppe · 2021-04-12T09:43:59Z

I am struggling to avoid a noticeable performance overhead from making this change.
How much performance penalty are we willing to pay for per-field position info?

mvdan · 2021-04-28T11:10:20Z

Roger and I briefly discussed this offline. Rebased on the latest master, I could only reproduce a smaller perf loss, mainly for large fields:

name                                     old time/op    new time/op    delta
Read-8                                     2.74µs ± 1%    2.74µs ± 0%    ~     (p=0.368 n=6+6)
ReadWithFieldsPerRecord-8                  2.73µs ± 0%    2.74µs ± 0%  +0.29%  (p=0.013 n=6+6)
ReadWithoutFieldsPerRecord-8               2.74µs ± 0%    2.75µs ± 1%    ~     (p=0.091 n=5+6)
ReadLargeFields-8                          4.52µs ± 0%    4.70µs ± 1%  +4.03%  (p=0.004 n=6+5)
ReadReuseRecord-8                          1.53µs ± 1%    1.53µs ± 1%    ~     (p=0.727 n=6+6)
ReadReuseRecordWithFieldsPerRecord-8       1.54µs ± 0%    1.53µs ± 0%  -0.65%  (p=0.008 n=5+5)
ReadReuseRecordWithoutFieldsPerRecord-8    1.53µs ± 1%    1.56µs ± 1%  +2.31%  (p=0.002 n=6+6)
ReadReuseRecordLargeFields-8               3.26µs ± 0%    3.43µs ± 0%  +5.24%  (p=0.004 n=6+5)
Write-8                                    1.35µs ± 1%    1.35µs ± 1%    ~     (p=0.563 n=6+5)

The single commit includes a refactor moving some code to parseField, so it's hard to tell where the perf loss is, or if it's to blame on the extra func refactor itself. I think it would be a good idea to split the work into two commits, where the first commit does the extra func refactor with no behavioral changes, to get a better idea.

That aside, a 2-5% perf hit on some edge cases seems reasonable to me. Though I'm still curious why large fields in particular get slower, since that intuitively doesn't make sense - keeping track of positions shouldn't be slower for large fields.

rsc · 2021-04-28T17:39:11Z

This slowdown seems fine. Perhaps we will find more optimizations later.

gopherbot · 2021-04-29T00:28:52Z

Change https://golang.org/cl/314774 mentions this issue: encoding/csv: dd FieldPos method

gopherbot · 2021-05-27T23:26:19Z

Change https://golang.org/cl/323349 mentions this issue: doc/go1.17: mention new encoding/csv/Reader.FieldPos method

For #44221 For #44513 Change-Id: I2d2d1c55255f4411c11fd51f0f3ae726cbf4d136 Reviewed-on: https://go-review.googlesource.com/c/go/+/323349 Trust: Ian Lance Taylor <iant@golang.org> Reviewed-by: roger peppe <rogpeppe@gmail.com>

gopherbot added this to the Proposal milestone Feb 11, 2021

gopherbot added the Proposal label Feb 11, 2021

rsc mentioned this issue Feb 24, 2021

proposal: review meeting minutes #33502

Open

rsc added the Proposal-FinalCommentPeriod label Apr 1, 2021

rsc changed the title ~~proposal: encoding/csv: add the ability to get the line number of a record~~ encoding/csv: add the ability to get the line number of a record Apr 7, 2021

rsc added the Proposal-Accepted label Apr 7, 2021

rsc modified the milestones: Proposal, Backlog Apr 7, 2021

rogpeppe added a commit to influxdata/line-protocol that referenced this issue Apr 12, 2021

influxdata: use bytes for column offset

2d51648

[This comment](golang/go#44221 (comment)) related to columns in the encoding/csv package made me realise that we should do byte offset, not rune offset.

rogpeppe mentioned this issue Apr 12, 2021

influxdata: use bytes for column offset influxdata/line-protocol#38

Merged

rsc mentioned this issue Apr 28, 2021

encoding/xml: add Decoder.InputPos #45628

Closed

gopherbot closed this as completed in 6d95e5a Apr 29, 2021

rsc mentioned this issue Jun 10, 2021

api: audit for Go 1.17 #46688

Closed

83 tasks

golang locked and limited conversation to collaborators May 27, 2022

gopherbot added the FrozenDueToAge label May 27, 2022

rsc moved this to Accepted in Proposals Aug 10, 2022

rsc added this to Proposals Aug 10, 2022

rsc removed this from Proposals Oct 19, 2022

dmitshur modified the milestones: Backlog, Go1.17 Aug 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding/csv: add the ability to get the line number of a record #44221

encoding/csv: add the ability to get the line number of a record #44221

rogpeppe commented Feb 11, 2021

peterGo commented Feb 24, 2021

mvdan commented Feb 24, 2021

rogpeppe commented Feb 24, 2021

rsc commented Feb 24, 2021

rsc commented Feb 24, 2021

rogpeppe commented Feb 24, 2021 •

edited

Loading

rsc commented Mar 10, 2021

rogpeppe commented Mar 11, 2021

rsc commented Mar 24, 2021 •

edited

Loading

rogpeppe commented Mar 24, 2021

rsc commented Mar 31, 2021

rogpeppe commented Mar 31, 2021

rsc commented Apr 1, 2021

rogpeppe commented Apr 1, 2021

rsc commented Apr 1, 2021

rogpeppe commented Apr 1, 2021

rsc commented Apr 1, 2021

gopherbot commented Apr 2, 2021

rogpeppe commented Apr 2, 2021

rsc commented Apr 7, 2021

rogpeppe commented Apr 12, 2021

mvdan commented Apr 28, 2021

rsc commented Apr 28, 2021

gopherbot commented Apr 29, 2021

gopherbot commented May 27, 2021

encoding/csv: add the ability to get the line number of a record #44221

encoding/csv: add the ability to get the line number of a record #44221

Comments

rogpeppe commented Feb 11, 2021

peterGo commented Feb 24, 2021

mvdan commented Feb 24, 2021

rogpeppe commented Feb 24, 2021

rsc commented Feb 24, 2021

rsc commented Feb 24, 2021

rogpeppe commented Feb 24, 2021 • edited Loading

rsc commented Mar 10, 2021

rogpeppe commented Mar 11, 2021

rsc commented Mar 24, 2021 • edited Loading

rogpeppe commented Mar 24, 2021

rsc commented Mar 31, 2021

rogpeppe commented Mar 31, 2021

rsc commented Apr 1, 2021

rogpeppe commented Apr 1, 2021

rsc commented Apr 1, 2021

rogpeppe commented Apr 1, 2021

rsc commented Apr 1, 2021

gopherbot commented Apr 2, 2021

rogpeppe commented Apr 2, 2021

rsc commented Apr 7, 2021

rogpeppe commented Apr 12, 2021

mvdan commented Apr 28, 2021

rsc commented Apr 28, 2021

gopherbot commented Apr 29, 2021

gopherbot commented May 27, 2021

rogpeppe commented Feb 24, 2021 •

edited

Loading

rsc commented Mar 24, 2021 •

edited

Loading