REF/BUG/TYP: read_csv shouldn't close user-provided file handles #36997

twoertwein · 2020-10-09T01:04:08Z

closes BUG: Pandas closes user-provided file handles that it doesn't own. #36980
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

REF/BUG: de-duplicate all file handling in TextReader by calling get_handle in CParserWrapper. ~~When TextReader gets a string it uses memory mapping (it is given a file object in all other cases).~~

REF/TYP: The second commit adds a new return value to get_handle (whether the buffer is wrapped inside a TextIOWrapper: in that case we cannot close it, we need to detach it (and flush it if we wrote to it)). I made get_handle return a typed dataclass HandleArgs and made sure that all created handles are in HandleArgs.created_handles there is no need to close HandleArgs.handle (unless it is created by get_filename_or_buffer).

I used asserts for mypy when I'm 100% certain about the type, otherwise I added mypy ignore statements.

In the future it might be good to merge get_handle and get_filename_or_buffer.

pandas/tests/io/parser/test_encoding.py

pandas/_libs/parsers.pyx

pandas/io/parsers.py

jreback · 2020-10-10T16:00:09Z

cc @gfyoung

pandas/io/parsers.py

pandas/tests/io/parser/test_common.py

twoertwein · 2020-10-12T15:52:30Z

the failure on windows when reading ./pandas/tests/io/sas/data/test_sas7bdat_2.csv was caused when encoding was None. TextIOWrapper on windows seems to default to charmap instead of utf-8.

pandas/_libs/parsers.pyx

pandas/io/parsers.py

pep8speaks · 2020-10-16T21:42:41Z

Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-11-04 02:11:22 UTC

jbrockmendel · 2020-10-28T22:35:16Z

@twoertwein can you merge master and ill take another look

twoertwein · 2020-10-28T22:46:07Z

@jbrockmendel rebased. Sorry for the large diff. Most code changes are from changing the return value of get_handle.

pandas/_typing.py

jreback

looks good a bunch of small comments.

I think if you can make IOHandleArgs fully functional with methods then the io routines become simpler

pandas/_libs/parsers.pyx

pandas/_typing.py

pandas/io/common.py

pandas/io/formats/csvs.py

pandas/io/json/_json.py

pandas/io/parsers.py

jreback

looks really good @twoertwein if you'd merge master and ping on green.

@pandas-dev/pandas-core if any comments

pandas/_typing.py

pandas/io/excel/_base.py

pandas/io/formats/csvs.py

jreback · 2020-11-02T00:55:04Z

pandas/io/json/_json.py

+            mode="wt",
+            storage_options=storage_options,
+        )
+        handle_args = get_handle(


same here, add some comments to indicate the ifferences in the ioargs and handle_args

maybe think about having a handle arg IN IOArgs (that you get by calling ioargs.get_handle() but this might be too complicated / nested.

combining get_filepath_or_buffer (mostly used to open URLs) and get_handle (compression, opening files, wrapping bytes, and memory mapping) in some way would make a lot of sense. I feel that calling get_filepath_or_buffer inside get_handle would be a good solution. But I would need to first re-visit all places that call get_filepath_or_buffer and get_handle to make sure that this satisfies everyone.

@jreback Do you prefer to have this in this PR or in a followup? If adding more changes to this PR is okay from a review perspective, I'm tempted do add it to this PR (instead of touching IOArgs twice). I'll probably have time to combine them by the end of the next weekend. Is there a deadline/feature freeze for 1.2 upcoming?

i think a follow on or would be easier to review

1.2 is schedule for end of nov so have a little time

jreback · 2020-11-02T00:57:53Z

pandas/io/stata.py

-
-        if isinstance(path_or_buf, (str, bytes)):
-            self.path_or_buf = open(path_or_buf, "rb")
+        self.ioargs = get_filepath_or_buffer(


cc @bashtage if any comments here

this change is more for convenience: get_filepath_or_buffer will make only meaningful operations on strings/paths (open fsspec resources) but in all cases it creates an IOArgs. Without this, we would need to instatiate an IOArgs if path_or_buf is not a string.

pandas/tests/frame/methods/test_to_csv.py

pandas/io/json/_json.py

jorisvandenbossche · 2020-11-03T20:53:49Z

pandas/_typing.py

+    created_handles: List[Buffer] = dataclasses.field(default_factory=list)
+    is_wrapped: bool = False
+
+    def close(self) -> None:


If there is actual functionality in here, shouldn't we rather move this to io/common.py (or somewhere in the io module) ?
I would expect to have pure typing-related things in this file.

okay, I will move it to io/common. Do you think it is preferable to then import IOHandles (and IOArgs) in _typing so that other modules can import IOHandles from _typing (there are only a few places that will need to import them) or should they directly import it from io/common?
@jorisvandenbossche @jbrockmendel

yes ok to move it to io/common.py

…that all created handlers are returned

jreback

very small comments, ping on green.

pandas/_typing.py

pandas/io/json/_json.py

pandas/io/parsers.py

jreback · 2020-11-04T00:07:28Z

pandas/io/parsers.py

@@ -1403,8 +1398,7 @@ def _validate_parse_dates_presence(self, columns: List[str]) -> None:
            )

    def close(self):
-        for f in self.handles:
-            f.close()
+        self.handles.close()


does this not need self.ioargs.close()?

_read calls get_filepath_or_buffer and closes the handle afterwards itself. It then passes down the file handle to TextFileReader and ParserBase.

jreback · 2020-11-04T00:08:38Z

pandas/io/parsers.py


-        # close additional handles opened by C parser (for compression)
+        # close additional handles opened by C parser (for memory_map)


in theory you could add this to ioargs right? (and then add the try/except on the ioargs close); certainly its fine here unless that suggestion is simpler

I don't know, I'm not familiar enough with resources in the c/cython part. I would prefer if this close call is called by the c-engine itself (or its destructor). I honestly don't like that resources are closed by a different class/function that didn't created it.

pandas/io/pickle.py

jreback · 2020-11-04T00:10:26Z

pandas/_typing.py:1:1: F401 'dataclasses' imported but unused
pandas/_typing.py:6:1: F401 'typing.Generic' imported but unused

from pre-commit checks

jreback · 2020-11-04T02:11:28Z

lgtm ping on green.

twoertwein · 2020-11-04T03:30:54Z

@jreback green'ish

jreback · 2020-11-04T03:38:42Z

i restarted the pre commit

twoertwein · 2020-11-04T03:46:31Z

@jreback pre-commit is green

jreback · 2020-11-04T04:09:34Z

thanks @twoertwein really nice!

… (#37655) * Moving the file test_frame.py to a new directory * Сreated file test_frame_color.py * Transfer tests of test_frame.py to test_frame_color.py * PEP 8 fixes * Transfer tests of test_frame.py to test_frame_groupby.py and test_frame_subplots.py * Removing unnecessary imports * PEP 8 fixes * Fixed class name * Transfer tests of test_frame.py to test_frame_subplots.py * Transfer tests of test_frame.py to test_frame_groupby.py, test_frame_subplots.py, test_frame_color.py * Changed class names * Removed unnecessary imports * Removed import * catch FutureWarnings (#37587) * TST/REF: collect indexing tests by method (#37590) * REF: prelims for single-path setitem_with_indexer (#37588) * ENH: __repr__ for 2D DTA/TDA (#37164) * CLN: de-duplicate _validate_where_value with _validate_setitem_value (#37595) * TST/REF: collect tests by method (#37589) * TST/REF: move remaining setitem tests from test_timeseries * TST/REF: rehome test_timezones test * move misplaced arithmetic test * collect tests by method * move misplaced file * REF: Categorical.is_dtype_equal -> categories_match_up_to_permutation (#37545) * CLN refactor non-core (#37580) * refactor core/computation (#37585) * TST/REF: share method tests between DataFrame and Series (#37596) * BUG: Index.where casting ints to str (#37591) * REF: IntervalArray comparisons (#37124) * regression fix for merging DF with datetime index with empty DF (#36897) * ERR: fix error message in Period for invalid frequency (#37602) * CLN: remove rebox_native (#37608) * TST/REF: tests.generic (#37618) * TST: collect tests by method (#37617) * TST/REF: collect test_timeseries tests by method * misplaced DataFrame.values tst * misplaced dataframe.values test * collect test by method * TST/REF: share tests across Series/DataFrame (#37616) * Gh 36562 typeerror comparison not supported between float and str (#37096) * docs: fix punctuation (#37612) * REGR: pd.to_hdf(..., dropna=True) not dropping missing rows (#37564) * parametrize set_axis tests (#37619) * CLN: clean color selection in _matplotlib/style (#37203) * DEPR: DataFrame/Series.slice_shift (#37601) * REF: re-use validate_setitem_value in Categorical.fillna (#37597) * PERF: release gil for ewma_time (#37389) * BUG: Groupy dropped nan groups from result when grouping over single column (#36842) * ENH: implement timeszones support for read_json(orient='table') and astype() from 'object' (#35973) * REF/BUG/TYP: read_csv shouldn't close user-provided file handles (#36997) * BUG/REF: read_csv shouldn't close user-provided file handles * get_handle: typing, returns is_wrapped, use dataclass, and make sure that all created handlers are returned * remove unused imports * added IOHandleArgs.close * added IOArgs.close * mostly comments * move memory_map from TextReader to CParserWrapper * moved IOArgs and IOHandles * more comments Co-authored-by: Jeff Reback <jeff@reback.net> * more typing checks to pre-commit (#37539) * TST: 32bit dtype compat test_groupby_dropna (#37623) * BUG: Metadata propagation for groupby iterator (#37461) * BUG: read-only values in cython funcs (#37613) * CLN refactor core/arrays (#37581) * Fixed Metadata Propogation in DataFrame (#37381) * TYP: add Shape alias to pandas._typing (#37128) * DOC: Fix typo (#37630) * CLN: parametrize test_nat_comparisons (#37195) * dataframe dataclass docstring updated (#37632) * refactor core/groupby (#37583) * BUG: set index of DataFrame.apply(f) when f returns dict (#37544) (#37606) * BUG: to_dict should return a native datetime object for NumPy backed dataframes (#37571) * ENH: memory_map for compressed files (#37621) * DOC: add example & prose of slicing with labels when index has duplicate labels (#36814) * DOC: add example & prose of slicing with labels when index has duplicate labels #36251 * DOC: proofread the sentence. Co-authored-by: Jun Kudo <jun-lab@junnoMacBook-Pro.local> * DOC: Fix typo (#37636) "columns(s)" sounded odd, I believe it was supposed to be just "column(s)". * CI: troubleshoot win py38 builds (#37652) * TST/REF: collect indexing tests by method (#37638) * TST/REF: collect tests for get_numeric_data (#37634) * misplaced loc test * TST/REF: collect get_numeric_data tests * REF: de-duplicate _validate_insert_value with _validate_scalar (#37640) * CI: catch windows py38 OSError (#37659) * share test (#37679) * TST: match matplotlib warning message (#37666) * TST: match matplotlib warning message * TST: match full message * pd.Series.loc.__getitem__ promotes to float64 instead of raising KeyError (#37687) * REF/TST: misplaced Categorical tests (#37678) * REF/TST: collect indexing tests by method (#37677) * CLN: only call _wrap_results one place in nanmedian (#37673) * TYP: Index._concat (#37671) * BUG: CategoricalIndex.equals casting non-categories to np.nan (#37667) * CLN: _replace_single (#37683) * REF: simplify _replace_single by noting regex kwarg is bool * Annotate * CLN: remove never-False convert kwarg * TYP: make more internal funcs keyword-only (#37688) * REF: make Series._replace_single a regular method (#37691) * REF: simplify cycling through colors (#37664) * REF: implement _wrap_reduction_result (#37660) * BUG: preserve fold in Timestamp.replace (#37644) * CLN: Clean indexing tests (#37689) * TST: fix warning for pie chart (#37669) * PERF: reverted change from commit 7d257c6 to solve issue #37081 (#37426) * DataFrameGroupby.boxplot fails when subplots=False (#28102) * ENH: Improve error reporting for wrong merge cols (#37547) * Transfer tests of test_frame.py to test_frame_color.py * PEP8 * Fixes for linter * Сhange pd.DateFrame to DateFrame * Move inconsistent namespace check to pre-commit, fixup more files (#37662) * check for inconsistent namespace usage * doc * typos * verbose regex * use verbose flag * use verbose flag * match both directions * add test * don't import annotations from future * update extra couple of cases * 🚚 rename * typing * don't use subprocess * don't type tests * use pathlib * REF: simplify NDFrame.replace, ObjectBlock.replace (#37704) * REF: implement Categorical.encode_with_my_categories (#37650) * REF: implement Categorical.encode_with_my_categories * privatize * BUG: unpickling modifies Block.ndim (#37657) * REF: dont support dt64tz in nanmean (#37658) * CLN: Simplify groupby head/tail tests (#37702) * Bug in loc raised for numeric label even when label is in Index (#37675) * REF: implement replace_regex, remove unreachable branch in ObjectBlock.replace (#37696) * TYP: Check untyped defs (except vendored) (#37556) * REF: remove ObjectBlock._replace_single (#37710) * Transfer tests of test_frame.py to test_frame_color.py * TST/REF: collect indexing tests by method (#37590) * PEP8 * Сhange DateFrame to pd.DateFrame * Сhange pd.DateFrame to DateFrame * Removing imports * Bug fixes * Bug fixes * Fix incorrect merge * test_frame_color.py edit * Transfer tests of test_frame.py to test_frame_color.py, test_frame_groupby.py and test_frame_subplots.py * Removing unnecessary imports * PEP8 * # Conflicts: # pandas/tests/plotting/frame/test_frame.py # pandas/tests/plotting/frame/test_frame_color.py # pandas/tests/plotting/frame/test_frame_subplots.py * Moving the file test_frame.py to a new directory * Transfer tests of test_frame.py to test_frame_color.py, test_frame_groupby.py and test_frame_subplots.py * Removing unnecessary imports * PEP8 * CLN: clean categorical indexes tests (#37721) * Fix merge error * PEP 8 fixes * Fix merge error * Removing unnecessary imports * PEP 8 fixes * Fixed class name * Transfer tests of test_frame.py to test_frame_subplots.py * Transfer tests of test_frame.py to test_frame_groupby.py, test_frame_subplots.py, test_frame_color.py * Changed class names * Removed unnecessary imports * Removed import * TST/REF: collect indexing tests by method (#37590) * TST: match matplotlib warning message (#37666) * TST: match matplotlib warning message * TST: match full message * TST: fix warning for pie chart (#37669) * Transfer tests of test_frame.py to test_frame_color.py * PEP8 * Fixes for linter * Сhange pd.DateFrame to DateFrame * Transfer tests of test_frame.py to test_frame_color.py * PEP8 * Сhange DateFrame to pd.DateFrame * Сhange pd.DateFrame to DateFrame * Removing imports * Bug fixes * Bug fixes * Fix incorrect merge * test_frame_color.py edit * Fix merge error * Fix merge error * Removing unnecessary features * Resolving Commit Conflicts daf999f 365d843 * black fix Co-authored-by: jbrockmendel <jbrockmendel@gmail.com> Co-authored-by: Marco Gorelli <m.e.gorelli@gmail.com> Co-authored-by: Philip Cerles <philip.cerles@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Sven <sven.schellenberg@paradynsystems.com> Co-authored-by: Micael Jarniac <micael@jarniac.com> Co-authored-by: Andrew Wieteska <48889395+arw2019@users.noreply.github.com> Co-authored-by: Maxim Ivanov <41443370+ivanovmg@users.noreply.github.com> Co-authored-by: Erfan Nariman <34067903+erfannariman@users.noreply.github.com> Co-authored-by: Fangchen Li <fangchen.li@outlook.com> Co-authored-by: patrick <61934744+phofl@users.noreply.github.com> Co-authored-by: attack68 <24256554+attack68@users.noreply.github.com> Co-authored-by: Torsten Wörtwein <twoertwein@users.noreply.github.com> Co-authored-by: Jeff Reback <jeff@reback.net> Co-authored-by: Janus <janus@insignificancegalore.net> Co-authored-by: Joel Whittier <rootbeerfriend@gmail.com> Co-authored-by: taytzehao <jtth95@gmail.com> Co-authored-by: ma3da <34522496+ma3da@users.noreply.github.com> Co-authored-by: junk <juntrp0207@gmail.com> Co-authored-by: Jun Kudo <jun-lab@junnoMacBook-Pro.local> Co-authored-by: Alex Kirko <alexander.kirko@gmail.com> Co-authored-by: Yassir Karroum <ukarroum17@gmail.com> Co-authored-by: Kaiqi Dong <kaiqi@kth.se> Co-authored-by: Richard Shadrach <45562402+rhshadrach@users.noreply.github.com> Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>

jbrockmendel reviewed Oct 9, 2020

View reviewed changes

pandas/tests/io/parser/test_encoding.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Oct 9, 2020

View reviewed changes

pandas/_libs/parsers.pyx Show resolved Hide resolved

jreback requested changes Oct 10, 2020

View reviewed changes

pandas/io/parsers.py Outdated Show resolved Hide resolved

pandas/io/parsers.py Outdated Show resolved Hide resolved

pandas/io/parsers.py Outdated Show resolved Hide resolved

pandas/io/parsers.py Outdated Show resolved Hide resolved

jreback added the IO CSV read_csv, to_csv label Oct 10, 2020

gfyoung reviewed Oct 11, 2020

View reviewed changes

pandas/io/parsers.py Outdated Show resolved Hide resolved

gfyoung reviewed Oct 11, 2020

View reviewed changes

pandas/io/parsers.py Outdated Show resolved Hide resolved

gfyoung reviewed Oct 11, 2020

View reviewed changes

pandas/tests/io/parser/test_common.py Show resolved Hide resolved

twoertwein mentioned this pull request Oct 14, 2020

ENH: .read_pickle(...) from zip containing hidden OS X/macOS metadata files/folders #37101

Closed

5 tasks

twoertwein marked this pull request as ready for review October 16, 2020 05:54

twoertwein commented Oct 16, 2020

View reviewed changes

pandas/_libs/parsers.pyx Outdated Show resolved Hide resolved

twoertwein commented Oct 16, 2020

View reviewed changes

pandas/io/parsers.py Outdated Show resolved Hide resolved

twoertwein changed the title ~~BUG/REF: read_csv shouldn't close user-provided file handles~~ REF/BUG/TYP: read_csv shouldn't close user-provided file handles Oct 25, 2020

twoertwein requested a review from jreback October 25, 2020 23:59

twoertwein commented Oct 28, 2020

View reviewed changes

pandas/_typing.py Show resolved Hide resolved

jreback requested changes Oct 31, 2020

View reviewed changes

jreback requested changes Nov 2, 2020

View reviewed changes

jreback added this to the 1.2 milestone Nov 2, 2020

twoertwein commented Nov 2, 2020

View reviewed changes

pandas/io/json/_json.py Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Nov 3, 2020

View reviewed changes

twoertwein added 6 commits November 3, 2020 18:54

BUG/REF: read_csv shouldn't close user-provided file handles

618b6eb

get_handle: typing, returns is_wrapped, use dataclass, and make sure …

4f26aea

…that all created handlers are returned

remove unused imports

443a91e

added IOHandleArgs.close

6a10513

added IOArgs.close

60fc0a8

mostly comments

e65c4d9

twoertwein added 2 commits November 3, 2020 18:54

move memory_map from TextReader to CParserWrapper

1378221

moved IOArgs and IOHandles

74c6872

jreback requested changes Nov 4, 2020

View reviewed changes

twoertwein and others added 2 commits November 3, 2020 20:26

more comments

4dc58a6

Merge branch 'master' into read_csv

8ce25d1

jreback approved these changes Nov 4, 2020

View reviewed changes

jreback merged commit a648fb2 into pandas-dev:master Nov 4, 2020

twoertwein deleted the read_csv branch November 4, 2020 04:13

phofl mentioned this pull request Jan 6, 2021

BUG: read_csv raising when null bytes are in skipped rows #38989

Closed

3 tasks

simonjayhawkins mentioned this pull request Jan 18, 2021

BUG: V1.2 DataFrame.to_csv() fails to write a file with codecs #39247

Closed

3 tasks

This was referenced Mar 10, 2021

CLN: remove unused c_encoding #40342

Merged

CLN: remove unused file opening and mmap code from parsers.pyx #40431

Merged

simonjayhawkins mentioned this pull request May 24, 2021

BUG: read_csv is failing with an encoding different that UTF-8 and memory_map set to True in version 1.2.4 #40986

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF/BUG/TYP: read_csv shouldn't close user-provided file handles #36997

REF/BUG/TYP: read_csv shouldn't close user-provided file handles #36997

twoertwein commented Oct 9, 2020 •

edited

Loading

jreback commented Oct 10, 2020

twoertwein commented Oct 12, 2020 •

edited

Loading

pep8speaks commented Oct 16, 2020 •

edited

Loading

jbrockmendel commented Oct 28, 2020

twoertwein commented Oct 28, 2020

jreback left a comment

jreback left a comment

jreback Nov 2, 2020

jreback Nov 2, 2020

twoertwein Nov 2, 2020

jreback Nov 2, 2020

jreback Nov 2, 2020

twoertwein Nov 2, 2020

jorisvandenbossche Nov 3, 2020

twoertwein Nov 3, 2020

jreback Nov 4, 2020

jreback left a comment

jreback Nov 4, 2020

twoertwein Nov 4, 2020

jreback Nov 4, 2020

twoertwein Nov 4, 2020

jreback commented Nov 4, 2020

jreback commented Nov 4, 2020

twoertwein commented Nov 4, 2020

jreback commented Nov 4, 2020

twoertwein commented Nov 4, 2020

jreback commented Nov 4, 2020


		# close additional handles opened by C parser (for compression)
		# close additional handles opened by C parser (for memory_map)

REF/BUG/TYP: read_csv shouldn't close user-provided file handles #36997

REF/BUG/TYP: read_csv shouldn't close user-provided file handles #36997

Conversation

twoertwein commented Oct 9, 2020 • edited Loading

jreback commented Oct 10, 2020

twoertwein commented Oct 12, 2020 • edited Loading

pep8speaks commented Oct 16, 2020 • edited Loading

Comment last updated at 2020-11-04 02:11:22 UTC

jbrockmendel commented Oct 28, 2020

twoertwein commented Oct 28, 2020

jreback left a comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Nov 4, 2020

jreback commented Nov 4, 2020

twoertwein commented Nov 4, 2020

jreback commented Nov 4, 2020

twoertwein commented Nov 4, 2020

jreback commented Nov 4, 2020

twoertwein commented Oct 9, 2020 •

edited

Loading

twoertwein commented Oct 12, 2020 •

edited

Loading

pep8speaks commented Oct 16, 2020 •

edited

Loading