Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/csv: add the ability to get the line number of a record #44221

Closed
rogpeppe opened this issue Feb 11, 2021 · 25 comments
Closed

encoding/csv: add the ability to get the line number of a record #44221

rogpeppe opened this issue Feb 11, 2021 · 25 comments

Comments

@rogpeppe
Copy link
Contributor

Issue #26679 was closed as being a niche feature, but it's actually quite common to need this information.

For example, after a record has been read, the syntax of its fields might be checked, and could be invalid. In this case, it's useful to be able to produce a good error message that points, at least roughly, to where the problem is. When CSV files can be millions of lines long, this is very important.

The workaround proposed there is awkward to implement (it's easy to get an io.Reader implementation wrong) and doesn't work currently anyway. The only other current alternative is to wrap the underlying reader with a reader that counts lines and only reads a line at a time, which is again awkward to implement, and introduces a second unecessary layer of buffering to slow things down.

As csv.Reader already keeps track of line numbers, providing that information in the API seems natural.

We propose a new method:

// Line returns the line number of the start of the most
// recently returned record. If no records have been returned,
// it returns zero.
func (r *Reader) Line() int

There's a straw-man implementation here.

Thanks to @mvdan for consultation on this issue.

@gopherbot gopherbot added this to the Proposal milestone Feb 11, 2021
@peterGo
Copy link
Contributor

peterGo commented Feb 24, 2021

Your proposal

// Line returns the line number of the start of the most
// recently returned record. If no records have been returned,
// it returns zero.
func (r *Reader) Line() int

Line is already defined as

// A ParseError is returned for parsing errors.
// Line numbers are 1-indexed and columns are 0-indexed.
type ParseError struct {
    StartLine int   // Line where the record starts; added in Go 1.10
    Line      int   // Line where the error occurred
    Column    int   // Column (rune index) where the error occurred
    Err       error // The actual error
}

You seem to be using Line to refer to StartLine. That's confusing. Rename the method to to StartLine.

Your straw-person implementation reflects the confusion.

// Line returns the line number of the start of the most
// recently returned record. If no records have been returned,
// it returns zero.
func (r *Reader) Line() int {
    return r.startLine
}

The method documentation should be precise. Add indexing information from the ParseError documentation.

Revised method

// StartLine returns the line number of the start of the most
// recently returned record. If no records have been returned,
// it returns zero.
// StartLine numbers are 1-indexed.
func (r *Reader) StartLine() int

@mvdan
Copy link
Member

mvdan commented Feb 24, 2021

I actually suggested renaming the StartLine method to Line, since otherwise it seems like the line number for the start of the entire CSV document instead of the current record.

@rogpeppe
Copy link
Contributor Author

As @mvdan said, my initial version of the proposal used StartLine rather than Line. I'm easy either way tbh. Any other votes?

@rsc
Copy link
Contributor

rsc commented Feb 24, 2021

Maybe it should be

// Pos returns the position of the most recently read record.
func (r *Reader) Pos() (startLine, endLine int)

It seems like if you want one you might want both?

@rsc
Copy link
Contributor

rsc commented Feb 24, 2021

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rogpeppe
Copy link
Contributor Author

rogpeppe commented Feb 24, 2021

I've never seen any requirement for finding the last line in particular.

What would be more useful however would be the ability to find the line and column for any given field on the current line. I'm not sure: is the CSV encoding reversible? If so, maybe it would be possible to determine that given the current line number and the current field values, which could provide the last line number as a natural consequence.

@rsc
Copy link
Contributor

rsc commented Mar 10, 2021

Even if you don't need the endLine normally, having Pos return both makes clear that they might be different, which could be useful.
It could even be called LastPos instead of Pos.
We could also save column info for the fields easily enough. It's a tiny amount of data and would be reused.
What API would you want to return the per-field info?

@rogpeppe
Copy link
Contributor Author

How about something like this?

// FieldPos returns the line and column corresponding to
// the field with the given index in the entry most recently
// returned by Read. Numbering of lines and columns starts at 1.
//
// If this is called with an out-of-bounds index, it panics.
func (r *Reader) FieldPos(index int) (line, column int)

Then the equivalent of my original Line method can be obtained
by calling FieldPos(0).

@rsc
Copy link
Contributor

rsc commented Mar 24, 2021

@rogpeppe, can you confirm that the API you suggested in the previous comment is implementable?
Assuming it is, it sounds good.

Does anyone object to the FieldPos API?

@rogpeppe
Copy link
Contributor Author

I'm pretty sure it's implementable. My main concern would be whether it can be implemented while adding negligible runtime overhead.

@rsc
Copy link
Contributor

rsc commented Mar 31, 2021

It occurred to me at some point over the past week that the Reader just needs a []int that it reuses across each decoding. That can't possibly be much overhead.

@rogpeppe
Copy link
Contributor Author

It occurred to me at some point over the past week that the Reader just needs a []int that it reuses across each decoding

Doesn't it need to be a slice of (line, column) pairs? Still not much overhead, but it might be non-negligible for some.
I guess if the line numbers are relative to the line number of the start of the line, then we don't need to worry about the line part overflowing a 32 bit int.

@rsc
Copy link
Contributor

rsc commented Apr 1, 2021

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

@rogpeppe
Copy link
Contributor Author

rogpeppe commented Apr 1, 2021

One question here: what is the correct unit to use for the column: bytes or runes? Currently the error counts columns in runes, but as a counter-example, the Go compiler's positions count columns in bytes. Counting in bytes is cheaper and easier, and I'd be inclined to do that except for the currently documented and exported semantics of the Column field in the ParseError type.

Would it be too bad if FieldPos returned the column as a byte offset, inconsistently from the column in ParseError ?

I was writing some code that calculates rune offsets lazily when FieldPos is called to avoid the overhead of calculating them on the fly, but it's a pain to do properly, especially in the presence of TrimLeadingSpace which can trim unicode characters that don't end up in the resulting field data.

@rsc
Copy link
Contributor

rsc commented Apr 1, 2021

Runes is pointless; please do bytes. :-)

@rogpeppe
Copy link
Contributor Author

rogpeppe commented Apr 1, 2021

What do you think about changing the existing behaviour of the column in the ParseError ?

@rsc
Copy link
Contributor

rsc commented Apr 1, 2021

I did a quick scan of my Go corpus from spring 2020. I found zero uses of the Column field, so I think it is safe to change.
I found some tests but they were only checking ASCII inputs.
I found some other implementations setting Column in their own csv.ParseErrors:

  • biogo thinks Column is the integer field number, not the rune offset.
  • dolt sets Column correctly to a rune index, but again there are zero uses of the field.

If we made the change I think it would be easy to approach the biogo and dolt folks to get their implementations fixed.

Here's the actual instances:

Correctly computing rune-based Column (would be affected)

github.com/dolthub/dolt/go@v0.0.0-20200327004621-c4414210ff5e/libraries/doltcore/table/untyped/csv/reader.go:385:				err = &csv.ParseError{StartLine: recordStartLine, Line: csvr.numLine, Column: col, Err: csv.ErrQuote}
github.com/dolthub/dolt/go@v0.0.0-20200327004621-c4414210ff5e/libraries/doltcore/table/untyped/csv/reader.go:403:				err = &csv.ParseError{StartLine: recordStartLine, Line: csvr.numLine, Column: col, Err: csv.ErrQuote}

Has tests checking Column but they're ASCII only, and the implementation is wrapping encoding/csv (would be unaffected):

github.com/mzimmerman/multicorecsv@v0.0.0-20160707020140-e722d0bc1822/multicorecsv_test.go:306:			} else if tt.Line != 0 && (tt.Line != perr.Line || tt.Column != perr.Column) {

Hard-coded Column: 1 (unaffected):

github.com/jszwec/csvutil@v1.2.3/csvutil_go110_test.go:10:var testUnmarshalInvalidFirstLineErr = &csv.ParseError{
github.com/jszwec/csvutil@v1.2.3/csvutil_go110_test.go:17:var testUnmarshalInvalidSecondLineErr = &csv.ParseError{
github.com/jszwec/csvutil@v1.2.3/csvutil_go17_test.go:7:var testUnmarshalInvalidFirstLineErr = &csv.ParseError{
github.com/jszwec/csvutil@v1.2.3/csvutil_go17_test.go:13:var testUnmarshalInvalidSecondLineErr = &csv.ParseError{

Thinks Column is CSV field index (unaffected):

github.com/biogo/biogo@v1.0.2/io/featio/bed/bed.go:100:		panic(&csv.ParseError{Column: column, Err: err})
github.com/biogo/biogo@v1.0.2/io/featio/bed/bed.go:108:		panic(&csv.ParseError{Column: column, Err: err})
github.com/biogo/biogo@v1.0.2/io/featio/bed/bed.go:126:		panic(&csv.ParseError{Column: index, Err: ErrBadStrandField})
github.com/biogo/biogo@v1.0.2/io/featio/bed/bed.go:130:		panic(&csv.ParseError{Column: index, Err: ErrBadStrand})
github.com/biogo/biogo@v1.0.2/io/featio/bed/bed.go:142:		panic(&csv.ParseError{Column: index, Err: ErrBadColorField})
github.com/biogo/biogo@v1.0.2/io/featio/bed/bed.go:402:		if err, ok := err.(*csv.ParseError); ok {
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:241:		panic(&csv.ParseError{Line: line, Column: index, Err: err})
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:252:		panic(&csv.ParseError{Line: line, Column: index, Err: err})
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:263:		panic(&csv.ParseError{Line: line, Column: index, Err: err})
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:281:		panic(&csv.ParseError{Line: line, Column: index, Err: ErrBadStrandField})
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:285:		panic(&csv.ParseError{Line: line, Column: index, Err: ErrBadStrand})
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:312:				panic(&csv.ParseError{Line: line, Column: column, Err: ErrBadTag})
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:338:			panic(&csv.ParseError{Line: line, Column: index, Err: ErrBadTag})
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:376:		return nil, &csv.ParseError{Line: r.line, Err: ErrEmptyMetaLine}
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:382:			return nil, &csv.ParseError{Line: r.line, Err: ErrNotHandled}
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:388:			return nil, &csv.ParseError{Line: r.line, Err: ErrBadMetaLine}
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:394:			return nil, &csv.ParseError{Line: r.line, Err: ErrBadMetaLine}
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:405:			return nil, &csv.ParseError{Line: r.line, Err: ErrBadMetaLine}
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:414:			return nil, &csv.ParseError{Line: r.line, Err: ErrBadMetaLine}
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:423:			return nil, &csv.ParseError{Line: r.line, Err: ErrBadMetaLine}
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:427:		return nil, &csv.ParseError{Line: r.line, Err: ErrNotHandled}
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:441:			return nil, &csv.ParseError{Line: r.line, Err: err}
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:449:			return nil, &csv.ParseError{Line: r.line, Err: ErrBadSequence}
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:488:			return nil, &csv.ParseError{Line: r.line, Err: err}
github.com/biogo/biogo@v1.0.2/io/featio/gff/gff.go:504:		return nil, &csv.ParseError{Line: r.line, Column: len(fields), Err: ErrFieldMissing}
github.com/biogo/biogo@v1.0.2/io/seqio/fai/error.go:11:func parseError(line, column int, err error) *csv.ParseError {
github.com/biogo/biogo@v1.0.2/io/seqio/fai/error.go:12:	return &csv.ParseError{
github.com/biogo/biogo@v1.0.2/io/seqio/fai/error_go1.10.go:11:func parseError(line, column int, err error) *csv.ParseError {
github.com/biogo/biogo@v1.0.2/io/seqio/fai/error_go1.10.go:12:	return &csv.ParseError{
github.com/biogo/biogo@v1.0.2/io/seqio/fai/fai.go:71:// contains non-unique records the error is a csv.ParseError identifying the second non-unique
github.com/biogo/biogo@v1.0.2/io/seqio/fai/fai.go:84:			if _, ok = r.(*csv.ParseError); !ok {
github.com/biogo/hts@v1.1.0/fai/fai.go:127:func parseError(line, column int, err error) *csv.ParseError {
github.com/biogo/hts@v1.1.0/fai/fai.go:128:	return &csv.ParseError{

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/291290 mentions this issue: encoding/csv: add FieldPos method

@rogpeppe
Copy link
Contributor Author

rogpeppe commented Apr 2, 2021

OK, I've updated https://go-review.googlesource.com/c/go/+/291290 to implement the above discussed semantics, and it also changes the semantics of ParseError.Column to be the same as go/token.Position (i.e. byte-base, one-based index).

Unfortunately the performance difference is noticeable (I think that's probably because we're appending 128 bits, not 64 bits to the fieldInfo slice, but I'm not sure).

@rsc
Copy link
Contributor

rsc commented Apr 7, 2021

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

@rsc rsc changed the title proposal: encoding/csv: add the ability to get the line number of a record encoding/csv: add the ability to get the line number of a record Apr 7, 2021
@rsc rsc modified the milestones: Proposal, Backlog Apr 7, 2021
rogpeppe added a commit to influxdata/line-protocol that referenced this issue Apr 12, 2021
[This comment](golang/go#44221 (comment)) related
to columns in the encoding/csv package made me realise that we should do
byte offset, not rune offset.
@rogpeppe
Copy link
Contributor Author

I am struggling to avoid a noticeable performance overhead from making this change.
How much performance penalty are we willing to pay for per-field position info?

@mvdan
Copy link
Member

mvdan commented Apr 28, 2021

Roger and I briefly discussed this offline. Rebased on the latest master, I could only reproduce a smaller perf loss, mainly for large fields:

name                                     old time/op    new time/op    delta
Read-8                                     2.74µs ± 1%    2.74µs ± 0%    ~     (p=0.368 n=6+6)
ReadWithFieldsPerRecord-8                  2.73µs ± 0%    2.74µs ± 0%  +0.29%  (p=0.013 n=6+6)
ReadWithoutFieldsPerRecord-8               2.74µs ± 0%    2.75µs ± 1%    ~     (p=0.091 n=5+6)
ReadLargeFields-8                          4.52µs ± 0%    4.70µs ± 1%  +4.03%  (p=0.004 n=6+5)
ReadReuseRecord-8                          1.53µs ± 1%    1.53µs ± 1%    ~     (p=0.727 n=6+6)
ReadReuseRecordWithFieldsPerRecord-8       1.54µs ± 0%    1.53µs ± 0%  -0.65%  (p=0.008 n=5+5)
ReadReuseRecordWithoutFieldsPerRecord-8    1.53µs ± 1%    1.56µs ± 1%  +2.31%  (p=0.002 n=6+6)
ReadReuseRecordLargeFields-8               3.26µs ± 0%    3.43µs ± 0%  +5.24%  (p=0.004 n=6+5)
Write-8                                    1.35µs ± 1%    1.35µs ± 1%    ~     (p=0.563 n=6+5)

The single commit includes a refactor moving some code to parseField, so it's hard to tell where the perf loss is, or if it's to blame on the extra func refactor itself. I think it would be a good idea to split the work into two commits, where the first commit does the extra func refactor with no behavioral changes, to get a better idea.

That aside, a 2-5% perf hit on some edge cases seems reasonable to me. Though I'm still curious why large fields in particular get slower, since that intuitively doesn't make sense - keeping track of positions shouldn't be slower for large fields.

@rsc
Copy link
Contributor

rsc commented Apr 28, 2021

This slowdown seems fine. Perhaps we will find more optimizations later.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/314774 mentions this issue: encoding/csv: dd FieldPos method

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/323349 mentions this issue: doc/go1.17: mention new encoding/csv/Reader.FieldPos method

gopherbot pushed a commit that referenced this issue May 28, 2021
For #44221
For #44513

Change-Id: I2d2d1c55255f4411c11fd51f0f3ae726cbf4d136
Reviewed-on: https://go-review.googlesource.com/c/go/+/323349
Trust: Ian Lance Taylor <iant@golang.org>
Reviewed-by: roger peppe <rogpeppe@gmail.com>
@rsc rsc mentioned this issue Jun 10, 2021
83 tasks
@golang golang locked and limited conversation to collaborators May 27, 2022
@rsc rsc moved this to Accepted in Proposals Aug 10, 2022
@rsc rsc added this to Proposals Aug 10, 2022
@rsc rsc removed this from Proposals Oct 19, 2022
@dmitshur dmitshur modified the milestones: Backlog, Go1.17 Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants