Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_fwf docs #49832

Closed
wants to merge 9 commits into from
Closed

Conversation

RonaldBarnes
Copy link

@RonaldBarnes RonaldBarnes commented Nov 22, 2022

Enhanced documentation on read_fwf: clarifies whitespace is stripped by default and how to override via setting delimiter.

pandas/io/parsers/readers.py had no mention of delimiter option.

if it is not spaces (e.g., '~').
Default are space and tab characters.
Used to specify the character(s) to strip from start and end of every field.
To preserve whitespace, set to a character that does not exist in the data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like an anti pattern, you should not use the function at all when you want to read the whole data as one column

Copy link
Author

@RonaldBarnes RonaldBarnes Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @phofl,

I think there's a misunderstanding.

This PR is merely documenting the current anti-patterns that exist in read_fwf:

  1. Input file contains 172 fields / columns, precisely defined by colspecs.
  2. File is read into DataFrame
  3. Data is mangled - white space is stripped
  4. To preserve white space, a delimiter field is also required, and its value must be something that will never appear at start or end of any field

Anti-patters observed:

  • In a fixed-width data file, data should not be changed unless explicitly requested
  • Fixed-width files do not have delimiters, rather colspecs
  • read_csv will preserve white space

I think parts of read_fwf were designed to handle tabular, human readable data, not flat database files. For reading tabular data, read_table seems the appropriate tool, IMHO.

TL;DR This PR is attempting to accurately describe the current behaviour as #16772 shows people still confused by it and #16950 didn't address readers.py.

Also, thank you @jbrockmendel for labelling this as Docs! I could not figure out how to do that myself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather fix this instead of documenting a workaround then

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to hear a fix is preferred. Working on that now.

Expecting controversy by breaking current default behaviour but will clearly document how to achieve current behaviour of stripping white space. Am inclined to also mention read_table as a potential solution for some users.

If anyone is using delimiter="~" as is mentioned as an example in the documentation, planning to continue to support such usage, but thinking to raise FutureWarning if delimiter keyword is used.

Is this reasonable / acceptable pandas policy?

Should I amend doc/source/whatsnew/v1.5.3.rst or doc/source/whatsnew/v2.0.0.rst?

Thank you for your help with all the issues with attempting a successful first PR!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.0, only regression fixes are backported

So to summarise: If a delimiter is passed and the character is present: What happens in this case? If a delimiter is passed and does not exist, all whitespaces are preserved, correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct on both counts.

  • If a delimiter is passed and the character is present, it is stripped from start & end of every field.
  • If a delimiter is passed (that is not a space char), then whitespaces are preserved.

Assigning default value(s) to delimiter:
https://github.com/pandas-dev/pandas/blob/main/pandas/io/parsers/python_parser.py#L1168

Stripping delimiter(s) from each field (thus also removes \n\r from each line):
https://github.com/pandas-dev/pandas/blob/main/pandas/io/parsers/python_parser.py#L1267

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the fact that we strip the character from the end of the fields documented anywhere? If no, we can definitely deprecate. This sounds odd

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is mentioned obliquely:

https://github.com/pandas-dev/pandas/blob/main/doc/source/user_guide/io.rst

The function parameters to read_fwf are largely the same as read_csv with two extra parameters, and a different usage of the delimiter parameter:
...
delimiter: Characters to consider as filler characters in the fixed-width file. Can be used to specify the filler character of the fields if it is not spaces (e.g., '~').

This is confusing, as for flat files any use of delimiters is unexpected since colspecs are defined (or, inferred - need to check the use of delimiters here).

In a flat file, there are no "filler character[s]", hence confusion.

Later, among the examples, is this:

The parser will take care of extra white spaces around the columns so it's ok to have extra separation between the columns in the file.

All examples are using tabular (human-readable) data.

Some conflation between read_table and read_fwf, IMHO.

See mentions at #16772, from 2017, and follow up questions still in 2022.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather fix this instead of documenting a workaround then

I think I've come up with a solution that

  • causes minimal disruption to users depending on existing behaviour
  • clearly documents existing behaviour
  • adds 2 options to give finer-grained control over the whitespace handling in read_fwf

A newer PR can be found at: #51569

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Dec 29, 2022
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. Additionally if a fix is required it should probably be discussed in an issue before moving forward to closing in favor of a future issue

@mroeschke mroeschke closed this Jan 4, 2023
RonaldBarnes added a commit to RonaldBarnes/pandas that referenced this pull request Jan 26, 2023
…ce' (default=True) and 'whitespace_chars' (default=[space] and [tab] chars). Deprecation warning for 'delimiter'.

See pandas-dev#49832 (comment)

Signed-off-by: Ronald Barnes <ron@ronaldbarnes.ca>
RonaldBarnes added a commit to RonaldBarnes/pandas that referenced this pull request Jan 27, 2023
	* 'keep_whitespace' (default=True)
	* 'whitespace_chars' (default=[space] and [tab] chars)

See:
	pandas-dev#49832 (comment)
	https://stackoverflow.com/questions/72235501/python-pandas-read-fwf-strips-white-space
	https://stackoverflow.com/questions/57012437/pandas-read-fwf-removes-white-space

* changes in pandas/io/parsers/readers.py:
		_fwf_defaults()
		read_fwf()
* pandas/io/parsers/python_parsers.py
		FixedWidthReader
			__init__
			__next__
		FixedWidthFieldParser
			__init__
			_make_reader

Signed-off-by: Ronald Barnes <ron@ronaldbarnes.ca>
@RonaldBarnes RonaldBarnes mentioned this pull request Jan 27, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants