diff --git a/doc/data/fx_prices b/doc/data/fx_prices deleted file mode 100644 index 38cadf26909a3..0000000000000 Binary files a/doc/data/fx_prices and /dev/null differ diff --git a/doc/data/mindex_ex.csv b/doc/data/mindex_ex.csv deleted file mode 100644 index 935ff936cd842..0000000000000 --- a/doc/data/mindex_ex.csv +++ /dev/null @@ -1,16 +0,0 @@ -year,indiv,zit,xit -1977,"A",1.2,.6 -1977,"B",1.5,.5 -1977,"C",1.7,.8 -1978,"A",.2,.06 -1978,"B",.7,.2 -1978,"C",.8,.3 -1978,"D",.9,.5 -1978,"E",1.4,.9 -1979,"C",.2,.15 -1979,"D",.14,.05 -1979,"E",.5,.15 -1979,"F",1.2,.5 -1979,"G",3.4,1.9 -1979,"H",5.4,2.7 -1979,"I",6.4,1.2 diff --git a/doc/data/test.xls b/doc/data/test.xls deleted file mode 100644 index db0f9dec7d5e4..0000000000000 Binary files a/doc/data/test.xls and /dev/null differ diff --git a/doc/source/user_guide/io.rst b/doc/source/user_guide/io.rst index be761bb97f320..705861a3aa568 100644 --- a/doc/source/user_guide/io.rst +++ b/doc/source/user_guide/io.rst @@ -837,14 +837,11 @@ input text data into ``datetime`` objects. The simplest case is to just pass in ``parse_dates=True``: .. ipython:: python - :suppress: f = open("foo.csv", "w") f.write("date,A,B,C\n20090101,a,1,2\n20090102,b,3,4\n20090103,c,4,5") f.close() -.. ipython:: python - # Use a column as an index, and parse it as dates. df = pd.read_csv("foo.csv", index_col=0, parse_dates=True) df @@ -862,7 +859,6 @@ order) and the new column names will be the concatenation of the component column names: .. ipython:: python - :suppress: data = ( "KORD,19990127, 19:00:00, 18:56:00, 0.8100\n" @@ -876,9 +872,6 @@ column names: with open("tmp.csv", "w") as fh: fh.write(data) -.. ipython:: python - - print(open("tmp.csv").read()) df = pd.read_csv("tmp.csv", header=None, parse_dates=[[1, 2], [1, 3]]) df @@ -1058,19 +1051,20 @@ While US date formats tend to be MM/DD/YYYY, many international formats use DD/MM/YYYY instead. For convenience, a ``dayfirst`` keyword is provided: .. ipython:: python - :suppress: data = "date,value,cat\n1/6/2000,5,a\n2/6/2000,10,b\n3/6/2000,15,c" + print(data) with open("tmp.csv", "w") as fh: fh.write(data) -.. ipython:: python - - print(open("tmp.csv").read()) - pd.read_csv("tmp.csv", parse_dates=[0]) pd.read_csv("tmp.csv", dayfirst=True, parse_dates=[0]) +.. ipython:: python + :suppress: + + os.remove("tmp.csv") + Writing CSVs to binary file objects +++++++++++++++++++++++++++++++++++ @@ -1133,8 +1127,9 @@ For large numbers that have been written with a thousands separator, you can set the ``thousands`` keyword to a string of length 1 so that integers will be parsed correctly: +By default, numbers with a thousands separator will be parsed as strings: + .. ipython:: python - :suppress: data = ( "ID|level|category\n" @@ -1146,11 +1141,6 @@ correctly: with open("tmp.csv", "w") as fh: fh.write(data) -By default, numbers with a thousands separator will be parsed as strings: - -.. ipython:: python - - print(open("tmp.csv").read()) df = pd.read_csv("tmp.csv", sep="|") df @@ -1160,7 +1150,6 @@ The ``thousands`` keyword allows integers to be parsed correctly: .. ipython:: python - print(open("tmp.csv").read()) df = pd.read_csv("tmp.csv", sep="|", thousands=",") df @@ -1239,16 +1228,13 @@ as a ``Series``: ``read_csv`` instead. .. ipython:: python - :suppress: + :okwarning: data = "level\nPatient1,123000\nPatient2,23000\nPatient3,1234018" with open("tmp.csv", "w") as fh: fh.write(data) -.. ipython:: python - :okwarning: - print(open("tmp.csv").read()) output = pd.read_csv("tmp.csv", squeeze=True) @@ -1365,15 +1351,11 @@ The ``dialect`` keyword gives greater flexibility in specifying the file format. By default it uses the Excel dialect but you can specify either the dialect name or a :class:`python:csv.Dialect` instance. -.. ipython:: python - :suppress: - - data = "label1,label2,label3\n" 'index1,"a,c,e\n' "index2,b,d,f" - Suppose you had data with unenclosed quotes: .. ipython:: python + data = "label1,label2,label3\n" 'index1,"a,c,e\n' "index2,b,d,f" print(data) By default, ``read_csv`` uses the Excel dialect and treats the double quote as @@ -1449,8 +1431,9 @@ a different usage of the ``delimiter`` parameter: Can be used to specify the filler character of the fields if it is not spaces (e.g., '~'). +Consider a typical fixed-width data file: + .. ipython:: python - :suppress: f = open("bar.csv", "w") data1 = ( @@ -1463,12 +1446,6 @@ a different usage of the ``delimiter`` parameter: f.write(data1) f.close() -Consider a typical fixed-width data file: - -.. ipython:: python - - print(open("bar.csv").read()) - In order to parse this file into a ``DataFrame``, we simply need to supply the column specifications to the ``read_fwf`` function along with the file name: @@ -1523,19 +1500,15 @@ Indexes Files with an "implicit" index column +++++++++++++++++++++++++++++++++++++ -.. ipython:: python - :suppress: - - f = open("foo.csv", "w") - f.write("A,B,C\n20090101,a,1,2\n20090102,b,3,4\n20090103,c,4,5") - f.close() - Consider a file with one less entry in the header than the number of data column: .. ipython:: python - print(open("foo.csv").read()) + data = "A,B,C\n20090101,a,1,2\n20090102,b,3,4\n20090103,c,4,5" + print(data) + with open("foo.csv", "w") as f: + f.write(data) In this special case, ``read_csv`` assumes that the first column is to be used as the index of the ``DataFrame``: @@ -1567,7 +1540,10 @@ Suppose you have data indexed by two columns: .. ipython:: python - print(open("data/mindex_ex.csv").read()) + data = 'year,indiv,zit,xit\n1977,"A",1.2,.6\n1977,"B",1.5,.5' + print(data) + with open("mindex_ex.csv", mode="w") as f: + f.write(data) The ``index_col`` argument to ``read_csv`` can take a list of column numbers to turn multiple columns into a ``MultiIndex`` for the index of the @@ -1575,9 +1551,14 @@ returned object: .. ipython:: python - df = pd.read_csv("data/mindex_ex.csv", index_col=[0, 1]) + df = pd.read_csv("mindex_ex.csv", index_col=[0, 1]) df - df.loc[1978] + df.loc[1977] + +.. ipython:: python + :suppress: + + os.remove("mindex_ex.csv") .. _io.multi_index_columns: @@ -1601,16 +1582,12 @@ rows will skip the intervening rows. of multi-columns indices. .. ipython:: python - :suppress: data = ",a,a,a,b,c,c\n,q,r,s,t,u,v\none,1,2,3,4,5,6\ntwo,7,8,9,10,11,12" - fh = open("mi2.csv", "w") - fh.write(data) - fh.close() - -.. ipython:: python + print(data) + with open("mi2.csv", "w") as fh: + fh.write(data) - print(open("mi2.csv").read()) pd.read_csv("mi2.csv", header=[0, 1], index_col=0) Note: If an ``index_col`` is not specified (e.g. you don't have an index, or wrote it @@ -1632,16 +1609,16 @@ comma-separated) files, as pandas uses the :class:`python:csv.Sniffer` class of the csv module. For this, you have to specify ``sep=None``. .. ipython:: python - :suppress: df = pd.DataFrame(np.random.randn(10, 4)) - df.to_csv("tmp.sv", sep="|") - df.to_csv("tmp2.sv", sep=":") + df.to_csv("tmp.csv", sep="|") + df.to_csv("tmp2.csv", sep=":") + pd.read_csv("tmp2.csv", sep=None, engine="python") .. ipython:: python + :suppress: - print(open("tmp2.sv").read()) - pd.read_csv("tmp2.sv", sep=None, engine="python") + os.remove("tmp2.csv") .. _io.multiple_files: @@ -1662,8 +1639,9 @@ rather than reading the entire file into memory, such as the following: .. ipython:: python - print(open("tmp.sv").read()) - table = pd.read_csv("tmp.sv", sep="|") + df = pd.DataFrame(np.random.randn(10, 4)) + df.to_csv("tmp.csv", sep="|") + table = pd.read_csv("tmp.csv", sep="|") table @@ -1672,7 +1650,7 @@ value will be an iterable object of type ``TextFileReader``: .. ipython:: python - with pd.read_csv("tmp.sv", sep="|", chunksize=4) as reader: + with pd.read_csv("tmp.csv", sep="|", chunksize=4) as reader: reader for chunk in reader: print(chunk) @@ -1685,14 +1663,13 @@ Specifying ``iterator=True`` will also return the ``TextFileReader`` object: .. ipython:: python - with pd.read_csv("tmp.sv", sep="|", iterator=True) as reader: + with pd.read_csv("tmp.csv", sep="|", iterator=True) as reader: reader.get_chunk(5) .. ipython:: python :suppress: - os.remove("tmp.sv") - os.remove("tmp2.sv") + os.remove("tmp.csv") Specifying the parser engine '''''''''''''''''''''''''''' @@ -2594,27 +2571,38 @@ Read in the content of the file from the above URL and pass it to ``read_html`` as a string: .. ipython:: python - :suppress: - rel_path = os.path.join("..", "pandas", "tests", "io", "data", "html", - "banklist.html") - file_path = os.path.abspath(rel_path) + html_str = """ + + + + + + + + + + + +
ABC
abc
+ """ + + with open("tmp.html", "w") as f: + f.write(html_str) + df = pd.read_html("tmp.html") + df[0] .. ipython:: python + :suppress: - with open(file_path, "r") as f: - dfs = pd.read_html(f.read()) - dfs + os.remove("tmp.html") You can even pass in an instance of ``StringIO`` if you so desire: .. ipython:: python - with open(file_path, "r") as f: - sio = StringIO(f.read()) - - dfs = pd.read_html(sio) - dfs + dfs = pd.read_html(StringIO(html_str)) + dfs[0] .. note:: @@ -2748,77 +2736,48 @@ in the method ``to_string`` described above. brevity's sake. See :func:`~pandas.core.frame.DataFrame.to_html` for the full set of options. -.. ipython:: python - :suppress: +.. note:: - def write_html(df, filename, *args, **kwargs): - static = os.path.abspath(os.path.join("source", "_static")) - with open(os.path.join(static, filename + ".html"), "w") as f: - df.to_html(f, *args, **kwargs) + In an HTML-rendering supported environment like a Jupyter Notebook, ``display(HTML(...))``` + will render the raw HTML into the environment. .. ipython:: python + from IPython.display import display, HTML + df = pd.DataFrame(np.random.randn(2, 2)) df - print(df.to_html()) # raw html - -.. ipython:: python - :suppress: - - write_html(df, "basic") - -HTML: - -.. raw:: html - :file: ../_static/basic.html + html = df.to_html() + print(html) # raw html + display(HTML(html)) The ``columns`` argument will limit the columns shown: .. ipython:: python - print(df.to_html(columns=[0])) - -.. ipython:: python - :suppress: - - write_html(df, "columns", columns=[0]) - -HTML: - -.. raw:: html - :file: ../_static/columns.html + html = df.to_html(columns=[0]) + print(html) + display(HTML(html)) ``float_format`` takes a Python callable to control the precision of floating point values: .. ipython:: python - print(df.to_html(float_format="{0:.10f}".format)) - -.. ipython:: python - :suppress: - - write_html(df, "float_format", float_format="{0:.10f}".format) + html = df.to_html(float_format="{0:.10f}".format) + print(html) + display(HTML(html)) -HTML: - -.. raw:: html - :file: ../_static/float_format.html ``bold_rows`` will make the row labels bold by default, but you can turn that off: .. ipython:: python - print(df.to_html(bold_rows=False)) - -.. ipython:: python - :suppress: - - write_html(df, "nobold", bold_rows=False) + html = df.to_html(bold_rows=False) + print(html) + display(HTML(html)) -.. raw:: html - :file: ../_static/nobold.html The ``classes`` argument provides the ability to give the resulting HTML table CSS classes. Note that these classes are *appended* to the existing @@ -2839,17 +2798,9 @@ that contain URLs. "url": ["https://www.python.org/", "https://pandas.pydata.org"], } ) - print(url_df.to_html(render_links=True)) - -.. ipython:: python - :suppress: - - write_html(url_df, "render_links", render_links=True) - -HTML: - -.. raw:: html - :file: ../_static/render_links.html + html = url_df.to_html(render_links=True) + print(html) + display(HTML(html)) Finally, the ``escape`` argument allows you to control whether the "<", ">" and "&" characters escaped in the resulting HTML (by default it is @@ -2859,30 +2810,21 @@ Finally, the ``escape`` argument allows you to control whether the df = pd.DataFrame({"a": list("&<>"), "b": np.random.randn(3)}) - -.. ipython:: python - :suppress: - - write_html(df, "escape") - write_html(df, "noescape", escape=False) - Escaped: .. ipython:: python - print(df.to_html()) - -.. raw:: html - :file: ../_static/escape.html + html = df.to_html() + print(html) + display(HTML(html)) Not escaped: .. ipython:: python - print(df.to_html(escape=False)) - -.. raw:: html - :file: ../_static/noescape.html + html = df.to_html(escape=False) + print(html) + display(HTML(html)) .. note:: @@ -3062,13 +3004,10 @@ Read in the content of the "books.xml" file and pass it to ``read_xml`` as a string: .. ipython:: python - :suppress: - rel_path = os.path.join("..", "pandas", "tests", "io", "data", "xml", - "books.xml") - file_path = os.path.abspath(rel_path) - -.. ipython:: python + file_path = "books.xml" + with open(file_path, "w") as f: + f.write(xml) with open(file_path, "r") as f: df = pd.read_xml(f.read()) @@ -3128,6 +3067,11 @@ Specify only elements or only attributes to parse: df = pd.read_xml(file_path, attrs_only=True) df +.. ipython:: python + :suppress: + + os.remove("books.xml") + XML documents can have namespaces with prefixes and default namespaces without prefixes both of which are denoted with a special attribute ``xmlns``. In order to parse by node under a namespace context, ``xpath`` must reference a prefix. @@ -5672,7 +5616,6 @@ the database using :func:`~pandas.DataFrame.to_sql`. .. ipython:: python - :suppress: import datetime @@ -5685,10 +5628,8 @@ the database using :func:`~pandas.DataFrame.to_sql`. data = pd.DataFrame(d, columns=c) -.. ipython:: python - - data - data.to_sql("data", engine) + data + data.to_sql("data", engine) With some databases, writing large DataFrames can result in errors due to packet size limitations being exceeded. This can be avoided by setting the diff --git a/doc/source/user_guide/scale.rst b/doc/source/user_guide/scale.rst index 71aef4fdd75f6..a8591c5d3a2c7 100644 --- a/doc/source/user_guide/scale.rst +++ b/doc/source/user_guide/scale.rst @@ -82,6 +82,13 @@ Option 2 only loads the columns we request. pd.read_parquet("timeseries_wide.parquet", columns=columns) +.. ipython:: python + :suppress: + + import os + + os.remove("timeseries_wide.parquet") + If we were to measure the memory usage of the two calls, we'd see that specifying ``columns`` uses about 1/10th the memory in this case. @@ -102,6 +109,11 @@ can store larger datasets in memory. ts = pd.read_parquet("timeseries.parquet") ts +.. ipython:: python + :suppress: + + os.remove("timeseries.parquet") + Now, let's inspect the data types and memory usage to see where we should focus our attention. @@ -364,6 +376,13 @@ out of memory. At that point it's just a regular pandas object. @savefig dask_resample.png ddf[["x", "y"]].resample("1D").mean().cumsum().compute().plot() +.. ipython:: python + :suppress: + + import shutil + + shutil.rmtree("data/timeseries") + These Dask examples have all be done using multiple processes on a single machine. Dask can be `deployed on a cluster `_ to scale up to even larger diff --git a/doc/source/whatsnew/v0.9.1.rst b/doc/source/whatsnew/v0.9.1.rst index 6b05e5bcded7e..ce89b47e35da0 100644 --- a/doc/source/whatsnew/v0.9.1.rst +++ b/doc/source/whatsnew/v0.9.1.rst @@ -95,11 +95,12 @@ New features - Enable referencing of Excel columns by their column names (:issue:`1936`) - .. ipython:: python + .. code-block:: ipython + + In [1]: xl = pd.ExcelFile('data/test.xls') - xl = pd.ExcelFile('data/test.xls') - xl.parse('Sheet1', index_col=0, parse_dates=True, - parse_cols='A:D') + In [2]: xl.parse('Sheet1', index_col=0, parse_dates=True, + parse_cols='A:D') - Added option to disable pandas-style tick locators and formatters diff --git a/pandas/tests/util/test_show_versions.py b/pandas/tests/util/test_show_versions.py index 4a962520460b0..53521cda5d271 100644 --- a/pandas/tests/util/test_show_versions.py +++ b/pandas/tests/util/test_show_versions.py @@ -89,7 +89,9 @@ def test_show_versions_console(capsys): # check required dependency # 2020-12-09 npdev has "dirty" in the tag - assert re.search(r"numpy\s*:\s([0-9\.\+a-g\_]|dev)+(dirty)?\n", result) + # 2022-05-25 npdev released with RC wo/ "dirty". + # Just ensure we match [0-9]+\..* since npdev version is variable + assert re.search(r"numpy\s*:\s[0-9]+\..*\n", result) # check optional dependency assert re.search(r"pyarrow\s*:\s([0-9\.]+|None)\n", result)