Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_fwf modifies / corrupts object (string) whitespace data #51569

Open
3 tasks done
RonaldBarnes opened this issue Feb 22, 2023 · 6 comments
Open
3 tasks done

BUG: read_fwf modifies / corrupts object (string) whitespace data #51569

RonaldBarnes opened this issue Feb 22, 2023 · 6 comments
Assignees
Labels

Comments

@RonaldBarnes
Copy link

RonaldBarnes commented Feb 22, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from io import StringIO

data = """
CompanyName Alice        7500.00 5  1 3 5 D F 
OrganizationBob          8790.00 6 1 4 A 9 BC 
"""

df = pd.read_fwf(StringIO(data),
  widths=[12,12,8,2,12],
  header=None,
  names=["Company", "Contact", "Pay_sum", "Pay_count", "Credit_score"],
  )
print(df.values)

[['CompanyName' 'Alice' 7500.0 5 '1 3 5 D F']
 ['Organization' 'Bob' 8790.0 6 '1 4 A 9 BC']]

Issue Description

Note in code sample that the Credit_score has had leading (and trailing) spaces removed.
This is now irretrievably corrupted data.

Expected Behavior

Expected: The leading (at minimum) spaces are preserved, as they have significance.

The field width should match the widths or colspecs specified unless explicitly requesting otherwise.

Installed Versions

INSTALLED VERSIONS

commit : 2e218d1
python : 3.10.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.19.0-32-generic
Version : #33~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jan 30 17:03:34 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_CA.UTF-8
LOCALE : en_CA.UTF-8

pandas : 1.5.3
numpy : 1.23.5
pytz : 2022.1
dateutil : 2.8.2
setuptools : 59.6.0
pip : 22.0.2
Cython : 0.29.33
pytest : 7.2.1
hypothesis : 6.65.1
sphinx : 6.1.3
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.8.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : 1.0.9
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.6.3
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

@RonaldBarnes RonaldBarnes added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 22, 2023
@RonaldBarnes
Copy link
Author

Consider the following code:

import pandas as pd
from io import StringIO


data = ["left      ","  centre  ","     right"]

print(f"data: {data}")


## Preserves whitespace:
df = pd.DataFrame([data], columns=["One", "Two", "Three"])


## Preserves whitespace:
df_csv = pd.read_csv(
	StringIO(",".join(data)),
	header=None,
	names=["One", "Two", "Three"],
	)


## Corrupts whitespace:
df_fwf = pd.read_fwf(
	StringIO("".join(data)),
	header=None,
	widths=[10,10,10],
	names=["One", "Two", "Three"],
	)

print("DataFrame values:")
print(df.values)
print("read_csv values:")
print(df_csv.values)
print("read_fwf values:")
print(df_fwf.values)


data: ['left      ', '  centre  ', '     right']
DataFrame values:
[['left      ' '  centre  ' '     right']]
read_csv values:
[['left      ' '  centre  ' '     right']]
read_fwf values:
[['left' 'centre' 'right']]

The one dataframe creation method where I explicitly specify the column widths is the only method that changes my data when reading it.

@RonaldBarnes
Copy link
Author

Currently, there is a bit of a work-around for this issue, but it's quite an "anti-pattern", and very counter-intuitive.

Adding a delimiter argument to the read_fwf call produces different output, despite there being no delimiters in the fixed-width file (pretty much by definition):

Using the code snippet from previous comment, but adding a delimiter (must be something that does not exist at the boundaries of fields!):

df_fwf_delim = pd.read_fwf(
	StringIO("".join(data)),
	header=None,
	widths=[10,10,10],
	names=["One", "Two", "Three"],
	delimiter="?",
	)

print("read_fwf DELIMITER values:")
print(df_fwf_delim.values))

read_fwf DELIMITER values:
[['left      ' '  centre  ' '     right']]

The source of this behaviour is found at:

self.delimiter = "\r\n" + delimiter if delimiter else "\n\r\t "

return [line[from_:to].strip(self.delimiter) for (from_, to) in self.colspecs]

It seems that read_fwf has been made to work with tables to the detriment of actual fixed-width files (a format predating YAML, JSON, CSV, etc.).

A read_table already exists, so I question whether this behaviour belongs in this function.

@RonaldBarnes
Copy link
Author

Final note (for now):

There is no mention of delimiters for read_fwf in the API reference at
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html

There is a confusing mention of them in the user_guide at
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-fwf-reader

delimiter: Characters to consider as filler characters in the fixed-width file.
Can be used to specify the filler character of the fields if it is not
spaces (e.g., ‘~’).

It can be argued that trailing spaces are "filler characters", but not leading
spaces.

Like in a Python script file, they have significance and should not be removed
unless explicitely requested.

@RonaldBarnes
Copy link
Author

take

RonaldBarnes added a commit to RonaldBarnes/pandas that referenced this issue Feb 23, 2023
…lowing

more control over handling of whitespace in fields and removing the
requirement to specify a `delimiter` in order to preserve whitespace. (pandas-dev#51569)

Signed-off-by: Ronald Barnes <ron@ronaldbarnes.ca>
RonaldBarnes added a commit to RonaldBarnes/pandas that referenced this issue Feb 23, 2023
…andas-dev#51569)

Signed-off-by: Ronald Barnes <ron@ronaldbarnes.ca>
RonaldBarnes added a commit to RonaldBarnes/pandas that referenced this issue Feb 23, 2023
…arguments

for `read_fwf`. (pandas-dev#51569)

Signed-off-by: Ronald Barnes <ron@ronaldbarnes.ca>
@d3sbar
Copy link

d3sbar commented Mar 13, 2023

I am using read_fwf function for data processing and I ran into a different behavior where:

Without specifying delimiter or lineterminator keywords -- the function works if file has more than 1 line (except it fails to read last line in a file), but fails when only 1 line in the text file (by default file has no header row):
pd.read_fwf(my_file_path, widths= _widths)

Only when I specify delimiter AND lineterminator -- does the function work properly to read all lines in the file:
pd.read_fwf(cclf_path, widths= _widths, header=None, delimiter='?', lineterminator='\n', on_bad_lines='Bad Line Error')

Seems to me that the "required TextReader kwargs" for read_fwf should either be documented (easy path to avoid regression) or the defaults need to re-visited for the purpose of processing a "fixed width file" (by definition no delimiters, typically no headers and many times, conditional column count based on a prefix value on the line, etc).

@RonaldBarnes RonaldBarnes mentioned this issue Mar 16, 2023
3 tasks
@RonaldBarnes
Copy link
Author

Re-opening.

Still hoping to hear feedback on current behaviour & effectiveness of this patch in addressing it.

@mroeschke mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants