BUG: read_fwf modifies / corrupts object (string) whitespace data #51569

RonaldBarnes · 2023-02-22T23:10:47Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from io import StringIO

data = """
CompanyName Alice        7500.00 5  1 3 5 D F 
OrganizationBob          8790.00 6 1 4 A 9 BC 
"""

df = pd.read_fwf(StringIO(data),
  widths=[12,12,8,2,12],
  header=None,
  names=["Company", "Contact", "Pay_sum", "Pay_count", "Credit_score"],
  )
print(df.values)

[['CompanyName' 'Alice' 7500.0 5 '1 3 5 D F']
 ['Organization' 'Bob' 8790.0 6 '1 4 A 9 BC']]

Issue Description

Note in code sample that the Credit_score has had leading (and trailing) spaces removed.
This is now irretrievably corrupted data.

Expected Behavior

Expected: The leading (at minimum) spaces are preserved, as they have significance.

The field width should match the widths or colspecs specified unless explicitly requesting otherwise.

Installed Versions

INSTALLED VERSIONS

commit : 2e218d1
python : 3.10.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.19.0-32-generic
Version : #33~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jan 30 17:03:34 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_CA.UTF-8
LOCALE : en_CA.UTF-8

pandas : 1.5.3
numpy : 1.23.5
pytz : 2022.1
dateutil : 2.8.2
setuptools : 59.6.0
pip : 22.0.2
Cython : 0.29.33
pytest : 7.2.1
hypothesis : 6.65.1
sphinx : 6.1.3
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.8.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : 1.0.9
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.6.3
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

The text was updated successfully, but these errors were encountered:

RonaldBarnes · 2023-02-22T23:30:57Z

Consider the following code:

import pandas as pd
from io import StringIO


data = ["left      ","  centre  ","     right"]

print(f"data: {data}")


## Preserves whitespace:
df = pd.DataFrame([data], columns=["One", "Two", "Three"])


## Preserves whitespace:
df_csv = pd.read_csv(
	StringIO(",".join(data)),
	header=None,
	names=["One", "Two", "Three"],
	)


## Corrupts whitespace:
df_fwf = pd.read_fwf(
	StringIO("".join(data)),
	header=None,
	widths=[10,10,10],
	names=["One", "Two", "Three"],
	)

print("DataFrame values:")
print(df.values)
print("read_csv values:")
print(df_csv.values)
print("read_fwf values:")
print(df_fwf.values)


data: ['left      ', '  centre  ', '     right']
DataFrame values:
[['left      ' '  centre  ' '     right']]
read_csv values:
[['left      ' '  centre  ' '     right']]
read_fwf values:
[['left' 'centre' 'right']]

The one dataframe creation method where I explicitly specify the column widths is the only method that changes my data when reading it.

RonaldBarnes · 2023-02-22T23:52:23Z

Currently, there is a bit of a work-around for this issue, but it's quite an "anti-pattern", and very counter-intuitive.

Adding a delimiter argument to the read_fwf call produces different output, despite there being no delimiters in the fixed-width file (pretty much by definition):

Using the code snippet from previous comment, but adding a delimiter (must be something that does not exist at the boundaries of fields!):

df_fwf_delim = pd.read_fwf(
	StringIO("".join(data)),
	header=None,
	widths=[10,10,10],
	names=["One", "Two", "Three"],
	delimiter="?",
	)

print("read_fwf DELIMITER values:")
print(df_fwf_delim.values))

read_fwf DELIMITER values:
[['left      ' '  centre  ' '     right']]

The source of this behaviour is found at:

pandas/pandas/io/parsers/python_parser.py

Line 1182 in 3f3102b

self.delimiter = "\r\n" + delimiter if delimiter else "\n\r\t "

pandas/pandas/io/parsers/python_parser.py

Line 1281 in 3f3102b

return [line[from_:to].strip(self.delimiter) for (from_, to) in self.colspecs]

It seems that read_fwf has been made to work with tables to the detriment of actual fixed-width files (a format predating YAML, JSON, CSV, etc.).

A read_table already exists, so I question whether this behaviour belongs in this function.

RonaldBarnes · 2023-02-22T23:54:37Z

Final note (for now):

There is no mention of delimiters for read_fwf in the API reference at
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html

There is a confusing mention of them in the user_guide at
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-fwf-reader

delimiter: Characters to consider as filler characters in the fixed-width file.
Can be used to specify the filler character of the fields if it is not
spaces (e.g., ‘~’).

It can be argued that trailing spaces are "filler characters", but not leading
spaces.

Like in a Python script file, they have significance and should not be removed
unless explicitely requested.

RonaldBarnes · 2023-02-22T23:54:44Z

take

…lowing more control over handling of whitespace in fields and removing the requirement to specify a `delimiter` in order to preserve whitespace. (pandas-dev#51569) Signed-off-by: Ronald Barnes <ron@ronaldbarnes.ca>

…andas-dev#51569) Signed-off-by: Ronald Barnes <ron@ronaldbarnes.ca>

…arguments for `read_fwf`. (pandas-dev#51569) Signed-off-by: Ronald Barnes <ron@ronaldbarnes.ca>

d3sbar · 2023-03-13T19:23:51Z

I am using read_fwf function for data processing and I ran into a different behavior where:

Without specifying delimiter or lineterminator keywords -- the function works if file has more than 1 line (except it fails to read last line in a file), but fails when only 1 line in the text file (by default file has no header row):
pd.read_fwf(my_file_path, widths= _widths)

Only when I specify delimiter AND lineterminator -- does the function work properly to read all lines in the file:
pd.read_fwf(cclf_path, widths= _widths, header=None, delimiter='?', lineterminator='\n', on_bad_lines='Bad Line Error')

Seems to me that the "required TextReader kwargs" for read_fwf should either be documented (easy path to avoid regression) or the defaults need to re-visited for the purpose of processing a "fixed width file" (by definition no delimiters, typically no headers and many times, conditional column count based on a prefix value on the line, etc).

RonaldBarnes · 2023-04-01T00:57:54Z

Re-opening.

Still hoping to hear feedback on current behaviour & effectiveness of this patch in addressing it.

RonaldBarnes added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 22, 2023

github-actions bot assigned RonaldBarnes Feb 23, 2023

RonaldBarnes added a commit to RonaldBarnes/pandas that referenced this issue Feb 23, 2023

TST: Integration tests for keep_whitespace option of read_fwf. (p…

0006637

…andas-dev#51569) Signed-off-by: Ronald Barnes <ron@ronaldbarnes.ca>

RonaldBarnes added a commit to RonaldBarnes/pandas that referenced this issue Feb 23, 2023

DOC: What's New for v2.0.0: keep_whitespace and whitespace_chars …

65d444d

…arguments for `read_fwf`. (pandas-dev#51569) Signed-off-by: Ronald Barnes <ron@ronaldbarnes.ca>

RonaldBarnes mentioned this issue Feb 23, 2023

Add keep_whitespace and whitespace_chars to read_fwf #51577

Closed

5 tasks

simonjayhawkins added the IO Fixed Width read_fwf label Feb 23, 2023

RonaldBarnes mentioned this issue Mar 16, 2023

read_fwf docs #49832

Closed

3 tasks

mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_fwf modifies / corrupts object (string) whitespace data #51569

BUG: read_fwf modifies / corrupts object (string) whitespace data #51569

RonaldBarnes commented Feb 22, 2023 •

edited

Loading

INSTALLED VERSIONS

RonaldBarnes commented Feb 22, 2023

RonaldBarnes commented Feb 22, 2023

RonaldBarnes commented Feb 22, 2023

RonaldBarnes commented Feb 22, 2023

d3sbar commented Mar 13, 2023

RonaldBarnes commented Apr 1, 2023

BUG: read_fwf modifies / corrupts object (string) whitespace data #51569

BUG: read_fwf modifies / corrupts object (string) whitespace data #51569

Comments

RonaldBarnes commented Feb 22, 2023 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

RonaldBarnes commented Feb 22, 2023

RonaldBarnes commented Feb 22, 2023

RonaldBarnes commented Feb 22, 2023

RonaldBarnes commented Feb 22, 2023

d3sbar commented Mar 13, 2023

RonaldBarnes commented Apr 1, 2023

RonaldBarnes commented Feb 22, 2023 •

edited

Loading