From 60136dccb63e879bcf4c76d62332746655d93a41 Mon Sep 17 00:00:00 2001 From: Chris Carini <6374067+ChrisCarini@users.noreply.github.com> Date: Sat, 20 Feb 2021 19:06:51 -0800 Subject: [PATCH 01/40] Removing extra "<" from link in README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 81878e8e3..b1838fe88 100644 --- a/README.md +++ b/README.md @@ -263,7 +263,7 @@ profile.to_file("output.html") Profiling your data is closely related to data validation: often validation rules are defined in terms of well-known statistics. -For that purpose, `pandas-profiling` integrates with [Great Expectations](https://www.greatexpectations.io>). +For that purpose, `pandas-profiling` integrates with [Great Expectations](https://www.greatexpectations.io). This a world-class open-source library that helps you to maintain data quality and improve communication about data between teams. Great Expectations allows you to create Expectations (which are basically unit tests for your data) and Data Docs (conveniently shareable HTML data reports). `pandas-profiling` features a method to create a suite of Expectations based on the results of your ProfileReport, which you can store, and use to validate another (or future) dataset. From 8b8d9b44b698fb4aee1684b01cfd78408161e4fc Mon Sep 17 00:00:00 2001 From: sbrugman Date: Sun, 21 Feb 2021 11:43:04 +0100 Subject: [PATCH 02/40] fix: github actions set env --- .github/workflows/ci.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 74e84c01d..48e1aa31b 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -105,7 +105,7 @@ jobs: - name: Create branch name run: | export BRANCH_NAME="${GITHUB_REF##*/}/" - echo "name=BRANCH_NAME" >> $BRANCH_NAME + echo "BRANCH_NAME=$BRANCH_NAME" >> $GITHUB_ENV - name: Move the changes to the gh-pages branch (release branch) run: | From a65dee400c6b4b9223e0ee4678ccbb1b0bce8fad Mon Sep 17 00:00:00 2001 From: sbrugman Date: Sun, 21 Feb 2021 11:54:14 +0100 Subject: [PATCH 03/40] docs: fix rst syntax and update module autodoc --- docsrc/source/pages/api/model.rst | 11 ++++++++++- docsrc/source/pages/changelog/v2_10_0.rst | 4 ++-- docsrc/source/pages/changelog/v2_11_0.rst | 2 +- docsrc/source/pages/changelog/v2_12_0.rst | 4 ++-- .../source/pages/great_expectations_integration.rst | 2 +- docsrc/source/pages/support.rst | 2 +- 6 files changed, 17 insertions(+), 8 deletions(-) diff --git a/docsrc/source/pages/api/model.rst b/docsrc/source/pages/api/model.rst index e95c6dfa7..31991bba6 100644 --- a/docsrc/source/pages/api/model.rst +++ b/docsrc/source/pages/api/model.rst @@ -8,8 +8,17 @@ Model .. autosummary:: :toctree: _autosummary - base describe + handler summary + summary_algorithms + summary_helpers + summary_helpers_image + summarizer + typeset + typeset_relations messages correlations + duplicates + sample + expectation_algorithms diff --git a/docsrc/source/pages/changelog/v2_10_0.rst b/docsrc/source/pages/changelog/v2_10_0.rst index 5b6ac3938..7d8ca99c0 100644 --- a/docsrc/source/pages/changelog/v2_10_0.rst +++ b/docsrc/source/pages/changelog/v2_10_0.rst @@ -1,5 +1,5 @@ Changelog v2.10.0 ----------------- +----------------- 🎉 Features ^^^^^^^^^^^ @@ -9,7 +9,7 @@ Changelog v2.10.0 - Restructure categorical variable overview 👷‍♂️ Internal Improvements -^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^^^^^^^^^^^^^ - Full visions integration for type system: read more `here `_. - Migrate from Travis CI to Github Actions... diff --git a/docsrc/source/pages/changelog/v2_11_0.rst b/docsrc/source/pages/changelog/v2_11_0.rst index d4b7aa3e2..0c05e30c5 100644 --- a/docsrc/source/pages/changelog/v2_11_0.rst +++ b/docsrc/source/pages/changelog/v2_11_0.rst @@ -1,5 +1,5 @@ Changelog v2.11.0 ----------------- +----------------- 🎉 Features ^^^^^^^^^^^ diff --git a/docsrc/source/pages/changelog/v2_12_0.rst b/docsrc/source/pages/changelog/v2_12_0.rst index 6b287f91b..b57917207 100644 --- a/docsrc/source/pages/changelog/v2_12_0.rst +++ b/docsrc/source/pages/changelog/v2_12_0.rst @@ -1,5 +1,5 @@ Changelog v2.12.0 ----------------- +----------------- 🎉 Features ^^^^^^^^^^^ @@ -10,7 +10,7 @@ Changelog v2.12.0 - 👷‍♂️ Internal Improvements -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^^^^^^^^^^^^^ - 📖 Documentation diff --git a/docsrc/source/pages/great_expectations_integration.rst b/docsrc/source/pages/great_expectations_integration.rst index e21e3bf61..46373d0c9 100644 --- a/docsrc/source/pages/great_expectations_integration.rst +++ b/docsrc/source/pages/great_expectations_integration.rst @@ -12,7 +12,7 @@ About Great Expectations ``expect_column_values_to_be_between(column="passenger_count", min_value=1, max_value=6)`` -Great Expectations then uses this statement to validate whether the column ``passenger_count`` in a given table is indeed between 1 and 6, and returns a success or failure result. The library currently provides :ref:`several dozen highly expressive built-in Expectations`, and allows you to write custom Expectations. +Great Expectations then uses this statement to validate whether the column ``passenger_count`` in a given table is indeed between 1 and 6, and returns a success or failure result. The library currently provides `several dozen highly expressive built-in Expectations `_, and allows you to write custom Expectations. Great Expectations renders Expectations to clean, human-readable documentation called *Data Docs*. These HTML docs contain both your Expectation Suites as well as your data validation results each time validation is run – think of it as a continuously updated data quality report. diff --git a/docsrc/source/pages/support.rst b/docsrc/source/pages/support.rst index d0c5e8e88..46ed2e1e4 100644 --- a/docsrc/source/pages/support.rst +++ b/docsrc/source/pages/support.rst @@ -15,7 +15,7 @@ Conda install defaults to v1.4.1 Some users experience that ``conda install -c conda-forge pandas-profiling`` defaults to 1.4.1. -More details, `here `_, `here `__ and `here `__. +More details, `[22] `_, `[448] `__ and `[563] `__. If creating a new environment with a fresh installation does not resolve this issue, or you have good reason to persist with the current environment, then you could try installing a specific version e.g. ``conda install -c conda-forge pandas-profiling=2.10.0``. If it fails with an **UnsatisfiableError** that suggests dependant packages are either missing or incompatible, then further intervention is required to resolve the *environment* issue. However, *conda* error messages in this regard may be too cryptic or insufficient to pinpoint the culprit, therefore you may have to resort to an alternate means of troubleshooting e.g using the `Mamba Package Manager `_. From 0a1cced620d1d80214ed4c4d3efc6b26d148bd52 Mon Sep 17 00:00:00 2001 From: Ian Eaves Date: Sun, 21 Feb 2021 10:21:13 -0600 Subject: [PATCH 04/40] modified summarizer (#704) feat: enable setting of typeset/summarizer refactor: summarizer extends handler --- src/pandas_profiling/model/handler.py | 149 ++++++++++++----------- src/pandas_profiling/model/summarizer.py | 25 ++-- src/pandas_profiling/profile_report.py | 17 ++- 3 files changed, 101 insertions(+), 90 deletions(-) diff --git a/src/pandas_profiling/model/handler.py b/src/pandas_profiling/model/handler.py index 7fae0ca5a..6fcf16843 100644 --- a/src/pandas_profiling/model/handler.py +++ b/src/pandas_profiling/model/handler.py @@ -1,69 +1,80 @@ -from functools import reduce -from typing import Type - -import networkx as nx -from visions import VisionsBaseType - -from pandas_profiling.model import typeset as ppt - - -def compose(functions): - """ - Compose a sequence of functions - :param functions: sequence of functions - :return: combined functions, e.g. [f(x), g(x)] -> g(f(x)) - """ - - def func(f, g): - def func2(*x): - res = g(*x) - if type(res) == bool: - return f(*x) - else: - return f(*res) - - return func2 - - return reduce(func, reversed(functions), lambda *x: x) - - -class Handler: - def __init__(self, mapping, typeset, *args, **kwargs): - self.mapping = mapping - self.typeset = typeset - - self._complete_dag() - - def _complete_dag(self): - for from_type, to_type in nx.topological_sort( - nx.line_graph(self.typeset.base_graph) - ): - self.mapping[to_type] = self.mapping[from_type] + self.mapping[to_type] - - def handle(self, dtype: Type[VisionsBaseType], *args, **kwargs) -> dict: - """ - - Returns: - object: - """ - op = compose(self.mapping.get(dtype, [])) - return op(*args) - - -def get_render_map(): - import pandas_profiling.report.structure.variables as render_algorithms - - render_map = { - ppt.Boolean: render_algorithms.render_boolean, - ppt.Numeric: render_algorithms.render_real, - ppt.Complex: render_algorithms.render_complex, - ppt.DateTime: render_algorithms.render_date, - ppt.Categorical: render_algorithms.render_categorical, - ppt.URL: render_algorithms.render_url, - ppt.Path: render_algorithms.render_path, - ppt.File: render_algorithms.render_file, - ppt.Image: render_algorithms.render_image, - ppt.Unsupported: render_algorithms.render_generic, - } - - return render_map +from functools import reduce +from typing import Callable, Dict, List, Type + +import networkx as nx +from visions import VisionsBaseType, VisionsTypeset + +from pandas_profiling.model import typeset as ppt + + +def compose(functions): + """ + Compose a sequence of functions + :param functions: sequence of functions + :return: combined functions, e.g. [f(x), g(x)] -> g(f(x)) + """ + + def func(f, g): + def func2(*x): + res = g(*x) + if type(res) == bool: + return f(*x) + else: + return f(*res) + + return func2 + + return reduce(func, reversed(functions), lambda *x: x) + + +class Handler: + """A generic handler + + Allows any custom mapping between data types and functions + """ + + def __init__( + self, + mapping: Dict[Type[VisionsBaseType], List[Callable]], + typeset: VisionsTypeset, + *args, + **kwargs + ): + self.mapping = mapping + self.typeset = typeset + + self._complete_dag() + + def _complete_dag(self): + for from_type, to_type in nx.topological_sort( + nx.line_graph(self.typeset.base_graph) + ): + self.mapping[to_type] = self.mapping[from_type] + self.mapping[to_type] + + def handle(self, dtype: Type[VisionsBaseType], *args, **kwargs) -> dict: + """ + + Returns: + object: + """ + op = compose(self.mapping.get(dtype, [])) + return op(*args) + + +def get_render_map(): + import pandas_profiling.report.structure.variables as render_algorithms + + render_map = { + ppt.Boolean: render_algorithms.render_boolean, + ppt.Numeric: render_algorithms.render_real, + ppt.Complex: render_algorithms.render_complex, + ppt.DateTime: render_algorithms.render_date, + ppt.Categorical: render_algorithms.render_categorical, + ppt.URL: render_algorithms.render_url, + ppt.Path: render_algorithms.render_path, + ppt.File: render_algorithms.render_file, + ppt.Image: render_algorithms.render_image, + ppt.Unsupported: render_algorithms.render_generic, + } + + return render_map diff --git a/src/pandas_profiling/model/summarizer.py b/src/pandas_profiling/model/summarizer.py index c839d6f2a..91b6b36b2 100644 --- a/src/pandas_profiling/model/summarizer.py +++ b/src/pandas_profiling/model/summarizer.py @@ -1,11 +1,10 @@ from typing import Type -import networkx as nx import numpy as np import pandas as pd from visions import VisionsBaseType -from pandas_profiling.model.handler import compose +from pandas_profiling.model.handler import Handler from pandas_profiling.model.summary_algorithms import ( describe_categorical_1d, describe_counts, @@ -31,20 +30,11 @@ ) -class BaseSummarizer: - def __init__(self, summary_map, typeset, *args, **kwargs): - self.summary_map = summary_map - self.typeset = typeset +class BaseSummarizer(Handler): + """A base summarizer - self._complete_summaries() - - def _complete_summaries(self): - for from_type, to_type in nx.topological_sort( - nx.line_graph(self.typeset.base_graph) - ): - self.summary_map[to_type] = ( - self.summary_map[from_type] + self.summary_map[to_type] - ) + Can be used to define custom summarizations + """ def summarize(self, series: pd.Series, dtype: Type[VisionsBaseType]) -> dict: """ @@ -52,12 +42,13 @@ def summarize(self, series: pd.Series, dtype: Type[VisionsBaseType]) -> dict: Returns: object: """ - summarizer_func = compose(self.summary_map.get(dtype, [])) - _, summary = summarizer_func(series, {"type": dtype}) + _, summary = self.handle(dtype, series, {"type": dtype}) return summary class PandasProfilingSummarizer(BaseSummarizer): + """The default Pandas Profiling summarizer""" + def __init__(self, typeset, *args, **kwargs): summary_map = { Unsupported: [ diff --git a/src/pandas_profiling/profile_report.py b/src/pandas_profiling/profile_report.py index fd15da139..b59e56f2a 100644 --- a/src/pandas_profiling/profile_report.py +++ b/src/pandas_profiling/profile_report.py @@ -7,13 +7,18 @@ import numpy as np import pandas as pd from tqdm.auto import tqdm +from visions import VisionsTypeset from pandas_profiling.config import config from pandas_profiling.expectations_report import ExpectationsReport from pandas_profiling.model.describe import describe as describe_df from pandas_profiling.model.messages import MessageType from pandas_profiling.model.sample import Sample -from pandas_profiling.model.summarizer import PandasProfilingSummarizer, format_summary +from pandas_profiling.model.summarizer import ( + BaseSummarizer, + PandasProfilingSummarizer, + format_summary, +) from pandas_profiling.model.typeset import ProfilingTypeSet from pandas_profiling.report import get_report_structure from pandas_profiling.report.presentation.flavours.html.templates import ( @@ -27,7 +32,7 @@ class ProfileReport(SerializeReport, ExpectationsReport): """Generate a profile report from a Dataset stored as a pandas `DataFrame`. - Used has is it will output its content as an HTML report in a Jupyter notebook. + Used as is, it will output its content as an HTML report in a Jupyter notebook. """ def __init__( @@ -41,6 +46,8 @@ def __init__( sample: Optional[dict] = None, config_file: Union[Path, str] = None, lazy: bool = True, + typeset: Optional[VisionsTypeset] = None, + summarizer: Optional[BaseSummarizer] = None, **kwargs, ): """Generate a ProfileReport based on a pandas DataFrame @@ -51,6 +58,8 @@ def __init__( config_file: a config file (.yml), mutually exclusive with `minimal` lazy: compute when needed sample: optional dict(name="Sample title", caption="Caption", data=pd.DataFrame()) + typeset: optional user typeset to use for type inference + summarizer: optional user summarizer to generate custom summary output **kwargs: other arguments, for valid arguments, check the default configuration file. """ config.clear() # to reset (previous) config. @@ -92,8 +101,8 @@ def __init__( self._html = None self._widgets = None self._json = None - self._typeset = None - self._summarizer = None + self._typeset = typeset + self._summarizer = summarizer if df is not None: # preprocess df From 7f2a61e1154faec9852a5c94d18d7e9feea201fa Mon Sep 17 00:00:00 2001 From: kurosch Date: Sat, 27 Feb 2021 12:56:37 +0100 Subject: [PATCH 05/40] Reverted the fastparquet dependency replacement of pyarrow The dependency to fastparquet is not required anymore since pyarrow is available in python3.8 now. --- requirements-test.txt | 2 +- tests/issues/test_issue147.py | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/requirements-test.txt b/requirements-test.txt index a5aff84da..89c4b5df1 100644 --- a/requirements-test.txt +++ b/requirements-test.txt @@ -4,7 +4,7 @@ codecov pytest-mypy pytest-cov nbval -fastparquet==0.5.0 +pyarrow flake8 check-manifest>=0.41 twine>=3.1.1 diff --git a/tests/issues/test_issue147.py b/tests/issues/test_issue147.py index 324ba99d0..8113b6262 100644 --- a/tests/issues/test_issue147.py +++ b/tests/issues/test_issue147.py @@ -13,7 +13,7 @@ def test_issue147(get_data_file): "https://github.com/Teradata/kylo/raw/master/samples/sample-data/parquet/userdata2.parquet", ) - df = pd.read_parquet(str(file_name), engine="fastparquet") + df = pd.read_parquet(str(file_name), engine="pyarrow") report = ProfileReport(df, title="PyArrow with Pandas Parquet Backend") html = report.to_html() assert type(html) == str From 08c43b4b83c0d8849931cf9e667d79a4fc4996dd Mon Sep 17 00:00:00 2001 From: Aarni Koskela Date: Mon, 1 Mar 2021 17:45:14 +0200 Subject: [PATCH 06/40] Move ipywidgets dependency to [notebook] extra (#708) --- docsrc/source/pages/changelog/v2_12_0.rst | 2 +- requirements.txt | 4 +--- setup.py | 6 +++++- 3 files changed, 7 insertions(+), 5 deletions(-) diff --git a/docsrc/source/pages/changelog/v2_12_0.rst b/docsrc/source/pages/changelog/v2_12_0.rst index b57917207..06b7ff826 100644 --- a/docsrc/source/pages/changelog/v2_12_0.rst +++ b/docsrc/source/pages/changelog/v2_12_0.rst @@ -27,4 +27,4 @@ Changelog v2.12.0 ⬆️ Dependencies ^^^^^^^^^^^^^^^^^^ -- \ No newline at end of file +- The `ipywidgets` dependency was moved to the `[notebook]` extra, so most of Jupyter will not be installed alongside this package by default. \ No newline at end of file diff --git a/requirements.txt b/requirements.txt index 34181cbc0..56149d813 100644 --- a/requirements.txt +++ b/requirements.txt @@ -20,6 +20,4 @@ tangled-up-in-unicode>=0.0.6 requests>=2.24.0 # Progress bar tqdm>=4.48.2 -# Jupyter notebook -ipywidgets>=7.5.1 -seaborn>=0.10.1 \ No newline at end of file +seaborn>=0.10.1 diff --git a/setup.py b/setup.py index 3a7f8f4f9..1b2cb5455 100644 --- a/setup.py +++ b/setup.py @@ -37,7 +37,11 @@ python_requires=">=3.6", install_requires=requirements, extras_require={ - "notebook": ["jupyter-client>=6.0.0", "jupyter-core>=4.6.3"], + "notebook": [ + "jupyter-client>=6.0.0", + "jupyter-core>=4.6.3", + "ipywidgets>=7.5.1", + ], }, package_data={ "pandas_profiling": ["py.typed"], From 446cd8d1766e7b7e097b73b0e9e98a6c632db5ba Mon Sep 17 00:00:00 2001 From: gverbock <32060943+gverbock@users.noreply.github.com> Date: Mon, 1 Mar 2021 16:46:46 +0100 Subject: [PATCH 07/40] feat: add number and prct of negative values (#696) * add number and prct of negative values Co-authored-by: Gilles Verbockhaven --- docsrc/source/pages/changelog/v2_12_0.rst | 60 +++++++++---------- .../model/summary_algorithms.py | 3 + .../report/structure/variables/render_real.py | 22 +++++-- tests/unit/test_describe.py | 6 ++ 4 files changed, 56 insertions(+), 35 deletions(-) diff --git a/docsrc/source/pages/changelog/v2_12_0.rst b/docsrc/source/pages/changelog/v2_12_0.rst index 06b7ff826..0314e8127 100644 --- a/docsrc/source/pages/changelog/v2_12_0.rst +++ b/docsrc/source/pages/changelog/v2_12_0.rst @@ -1,30 +1,30 @@ -Changelog v2.12.0 ------------------ - -🎉 Features -^^^^^^^^^^^ -- - -🐛 Bug fixes -^^^^^^^^^^^^ -- - -👷‍♂️ Internal Improvements -^^^^^^^^^^^^^^^^^^^^^^^^^^^ -- - -📖 Documentation -^^^^^^^^^^^^^^^^ -- - -⚠️ Deprecated -^^^^^^^^^^^^^^^^^ -- - -🚨 Breaking changes -^^^^^^^^^^^^^^^^^^^ -- - -⬆️ Dependencies -^^^^^^^^^^^^^^^^^^ -- The `ipywidgets` dependency was moved to the `[notebook]` extra, so most of Jupyter will not be installed alongside this package by default. \ No newline at end of file +Changelog v2.12.0 +----------------- + +🎉 Features +^^^^^^^^^^^ +- Add the number and the percentage of negative values for numerical variables `[695] `- (contributed by @gverbock). + +🐛 Bug fixes +^^^^^^^^^^^^ +- + +👷‍♂️ Internal Improvements +^^^^^^^^^^^^^^^^^^^^^^^^^^^ +- + +📖 Documentation +^^^^^^^^^^^^^^^^ +- + +⚠️ Deprecated +^^^^^^^^^^^^^^^^^ +- + +🚨 Breaking changes +^^^^^^^^^^^^^^^^^^^ +- + +⬆️ Dependencies +^^^^^^^^^^^^^^^^^^ +- The `ipywidgets` dependency was moved to the `[notebook]` extra, so most of Jupyter will not be installed alongside this package by default. diff --git a/src/pandas_profiling/model/summary_algorithms.py b/src/pandas_profiling/model/summary_algorithms.py index 0f7d2c0f8..b5150228d 100644 --- a/src/pandas_profiling/model/summary_algorithms.py +++ b/src/pandas_profiling/model/summary_algorithms.py @@ -181,6 +181,9 @@ def describe_numeric_1d(series: pd.Series, summary: dict) -> Tuple[pd.Series, di value_counts = summary["value_counts_without_nan"] summary["n_zeros"] = 0 + negative_index = value_counts.index < 0 + summary["n_negative"] = value_counts.loc[negative_index].sum() + summary["p_negative"] = summary["n_negative"] / summary["n"] infinity_values = [np.inf, -np.inf] infinity_index = value_counts.index.isin(infinity_values) diff --git a/src/pandas_profiling/report/structure/variables/render_real.py b/src/pandas_profiling/report/structure/variables/render_real.py index c24777dc5..e7ce82412 100644 --- a/src/pandas_profiling/report/structure/variables/render_real.py +++ b/src/pandas_profiling/report/structure/variables/render_real.py @@ -67,17 +67,17 @@ def render_real(summary): "fmt": "fmt_percent", "alert": "p_infinite" in summary["warn_fields"], }, - ] - ) - - table2 = Table( - [ { "name": "Mean", "value": summary["mean"], "fmt": "fmt_numeric", "alert": False, }, + ] + ) + + table2 = Table( + [ { "name": "Minimum", "value": summary["min"], @@ -102,6 +102,18 @@ def render_real(summary): "fmt": "fmt_percent", "alert": "p_zeros" in summary["warn_fields"], }, + { + "name": "Negative", + "value": summary["n_negative"], + "fmt": "fmt", + "alert": False, + }, + { + "name": "Negative (%)", + "value": summary["p_negative"], + "fmt": "fmt_percent", + "alert": False, + }, { "name": "Memory size", "value": summary["memory_size"], diff --git a/tests/unit/test_describe.py b/tests/unit/test_describe.py index 4fd2e87ae..1d6589df3 100644 --- a/tests/unit/test_describe.py +++ b/tests/unit/test_describe.py @@ -223,6 +223,8 @@ def expected_results(): "min": -10.0, "n_missing": 1, "p_missing": 0.11111111111111116, + "n_negative": 2, + "p_negative": 0.22222222222222222, "p_distinct": 6 / 8, "n": 9, "n_zeros": 2, @@ -253,6 +255,8 @@ def expected_results(): "min": -3.1415926535000001, "n_missing": 1, "p_missing": 0.11111111111111116, + "n_negative": 1, + "p_negative": 0.11111111111111116, "p_distinct": 1, "n_zeros": 0, "p_zeros": 0.0, @@ -307,6 +311,8 @@ def expected_results(): "min": 1.0, "n_missing": 0, "p_missing": 0.0, + "n_negative": 0, + "p_negative": 0.0, "n_infinite": 0, "n_distinct": 1, "p_distinct": 0.1111111111111111, From bdb956222a91d83b762445f4b8c29c4e21649945 Mon Sep 17 00:00:00 2001 From: sbrugman Date: Mon, 1 Mar 2021 17:46:35 +0100 Subject: [PATCH 08/40] Version bump --- docsrc/source/pages/changelog.rst | 2 ++ docsrc/source/pages/changelog/v2_12_0.rst | 22 +++-------------- docsrc/source/pages/changelog/v2_13_0.rst | 30 +++++++++++++++++++++++ setup.py | 2 +- src/pandas_profiling/version.py | 2 +- 5 files changed, 38 insertions(+), 20 deletions(-) create mode 100644 docsrc/source/pages/changelog/v2_13_0.rst diff --git a/docsrc/source/pages/changelog.rst b/docsrc/source/pages/changelog.rst index 0eca44ba5..4ed3f9145 100644 --- a/docsrc/source/pages/changelog.rst +++ b/docsrc/source/pages/changelog.rst @@ -2,6 +2,8 @@ Changelog ========= +.. include:: changelog/v2_12_0.rst + .. include:: changelog/v2_11_0.rst .. include:: changelog/v2_10_1.rst diff --git a/docsrc/source/pages/changelog/v2_12_0.rst b/docsrc/source/pages/changelog/v2_12_0.rst index 0314e8127..2b6b6a5e8 100644 --- a/docsrc/source/pages/changelog/v2_12_0.rst +++ b/docsrc/source/pages/changelog/v2_12_0.rst @@ -4,27 +4,13 @@ Changelog v2.12.0 🎉 Features ^^^^^^^^^^^ - Add the number and the percentage of negative values for numerical variables `[695] `- (contributed by @gverbock). - -🐛 Bug fixes -^^^^^^^^^^^^ -- - -👷‍♂️ Internal Improvements -^^^^^^^^^^^^^^^^^^^^^^^^^^^ -- +- Enable setting of typeset/summarizer (contributed by @ieaves) 📖 Documentation ^^^^^^^^^^^^^^^^ -- - -⚠️ Deprecated -^^^^^^^^^^^^^^^^^ -- - -🚨 Breaking changes -^^^^^^^^^^^^^^^^^^^ -- +- Fix link syntax (contributed by @ChrisCarini) ⬆️ Dependencies ^^^^^^^^^^^^^^^^^^ -- The `ipywidgets` dependency was moved to the `[notebook]` extra, so most of Jupyter will not be installed alongside this package by default. +- The `ipywidgets` dependency was moved to the `[notebook]` extra, so most of Jupyter will not be installed alongside this package by default (contributed by @akx). +- Replaced the (testing only) `fastparquet` dependency with `pyarrow` (default pandas parquet engine, contributed by @kurosch). \ No newline at end of file diff --git a/docsrc/source/pages/changelog/v2_13_0.rst b/docsrc/source/pages/changelog/v2_13_0.rst new file mode 100644 index 000000000..f2d44616a --- /dev/null +++ b/docsrc/source/pages/changelog/v2_13_0.rst @@ -0,0 +1,30 @@ +Changelog vx.y.z +---------------- + +🎉 Features +^^^^^^^^^^^ +- + +🐛 Bug fixes +^^^^^^^^^^^^ +- + +👷‍♂️ Internal Improvements +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +- + +📖 Documentation +^^^^^^^^^^^^^^^^ +- + +⚠️ Deprecated +^^^^^^^^^^^^^^^^^ +- + +🚨 Breaking changes +^^^^^^^^^^^^^^^^^^^ +- + +⬆️ Dependencies +^^^^^^^^^^^^^^^^^^ +- \ No newline at end of file diff --git a/setup.py b/setup.py index 1b2cb5455..e3e0df8e2 100644 --- a/setup.py +++ b/setup.py @@ -11,7 +11,7 @@ with (source_root / "requirements.txt").open(encoding="utf8") as f: requirements = f.readlines() -version = "2.11.0" +version = "2.12.0" with (source_root / "src" / "pandas_profiling" / "version.py").open( "w", encoding="utf-8" diff --git a/src/pandas_profiling/version.py b/src/pandas_profiling/version.py index 1ac806f2d..08d6674fd 100644 --- a/src/pandas_profiling/version.py +++ b/src/pandas_profiling/version.py @@ -1,2 +1,2 @@ """This file is auto-generated by setup.py, please do not alter.""" -__version__ = "2.11.0" +__version__ = "2.12.0" From 3b2cd4cb1f77fd69620dde968d1cf42b49598933 Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Mon, 1 Mar 2021 16:47:57 +0000 Subject: [PATCH 09/40] [pre-commit.ci] pre-commit autoupdate --- .pre-commit-config.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 52a42dd72..20305ac6d 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -5,7 +5,7 @@ repos: - id: black language_version: python3.8 - repo: https://github.com/nbQA-dev/nbQA - rev: 0.5.7 + rev: 0.5.9 hooks: - id: nbqa-black additional_dependencies: [ black==20.8b1 ] From 9f7c2d08cca95e9350c2fcea7c405869707896af Mon Sep 17 00:00:00 2001 From: Aarni Koskela Date: Tue, 2 Mar 2021 19:27:32 +0200 Subject: [PATCH 10/40] build: Upgrade phik This drops the hard dependency on numba. Refs #708 Refs KaveIO/PhiK#17 --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index 56149d813..e290cb734 100644 --- a/requirements.txt +++ b/requirements.txt @@ -13,7 +13,7 @@ htmlmin>=0.1.12 # Missing values missingno>=0.4.2 # Correlations -phik>=0.10.0 +phik>=0.11.1 # Text analysis tangled-up-in-unicode>=0.0.6 # Examples From 159f858879dada9f53bb466c8ed1f580b70e3b63 Mon Sep 17 00:00:00 2001 From: Ian Eaves Date: Fri, 5 Mar 2021 00:48:45 -0600 Subject: [PATCH 11/40] docs update --- README.md | 5 +++-- docsrc/source/pages/resources.rst | 2 +- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index b1838fe88..5aea21b6f 100644 --- a/README.md +++ b/README.md @@ -306,8 +306,9 @@ Types are a powerful abstraction for effective data analysis, that goes beyond t `pandas-profiling` currently, recognizes the following types: _Boolean, Numerical, Date, Categorical, URL, Path, File_ and _Image_. We have developed a type system for Python, tailored for data analysis: [visions](https://github.com/dylan-profiler/visions). -Selecting the right typeset drastically reduces the complexity the code of your analysis. -Future versions of `pandas-profiling` will have extended type support through `visions`! +Choosing an appropriate typeset can both improve the overall expressiveness and reduce the complexity of your analysis/code. +To learn more about `pandas-profiling`'s type system, check out the default implementation [here](https://github.com/pandas-profiling/pandas-profiling/blob/develop/src/pandas_profiling/model/typeset.py). +In the meantime, user customized summarizations and type definitions are now fully supported - if you have a specific use-case please reach out with ideas or a PR! ## Contributing diff --git a/docsrc/source/pages/resources.rst b/docsrc/source/pages/resources.rst index 16096af8d..81f85fff1 100644 --- a/docsrc/source/pages/resources.rst +++ b/docsrc/source/pages/resources.rst @@ -14,7 +14,7 @@ Notebooks Articles -------- - +- `Bringing Customization to Pandas Profiling `_ (Ian Eaves, March 5, 2021) - `Beginner Friendly Data Science Projects Accepting Contributions `_ (Adam Ross Nelson, January 18, 2021) - `Pandas profiling and exploratory data analysis with line one of code! `_ (Magdalena Konkiewicz, Jun 10, 2020) - `The Covid 19 health issue `_ (Concillier Kitungulu, April 20, 2020) From fe3344e5395ce97d69ab16cb47c41153c4e1eef8 Mon Sep 17 00:00:00 2001 From: Jimmy Stammers Date: Sun, 14 Mar 2021 14:17:30 +0000 Subject: [PATCH 12/40] fix: Add parse_strings_as_datetimes to datetime_expectations (#728) * patched args for datetime_expectations --- src/pandas_profiling/model/expectation_algorithms.py | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/src/pandas_profiling/model/expectation_algorithms.py b/src/pandas_profiling/model/expectation_algorithms.py index 83e748a36..efac257b5 100644 --- a/src/pandas_profiling/model/expectation_algorithms.py +++ b/src/pandas_profiling/model/expectation_algorithms.py @@ -69,7 +69,10 @@ def path_expectations(name, summary, batch, *args): def datetime_expectations(name, summary, batch, *args): if any(k in summary for k in ["min", "max"]): batch.expect_column_values_to_be_between( - name, min_value=summary.get("min"), max_value=summary.get("max") + name, + min_value=summary.get("min"), + max_value=summary.get("max"), + parse_strings_as_datetimes=True, ) return name, summary, batch From 1c905535eeb32ea952847a44b3eb9eb557365c8e Mon Sep 17 00:00:00 2001 From: Owen Lamont Date: Fri, 26 Mar 2021 01:21:52 +0800 Subject: [PATCH 13/40] Removed dead link to dark theme config file (#735) --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 5aea21b6f..e1e5351b4 100644 --- a/README.md +++ b/README.md @@ -239,7 +239,7 @@ A set of options is available in order to adapt the report generated. * `progress_bar` (`bool`): If True, `pandas-profiling` will display a progress bar. * `infer_dtypes` (`bool`): When `True` (default) the `dtype` of variables are inferred using `visions` using the typeset logic (for instance a column that has integers stored as string will be analyzed as if being numeric). -More settings can be found in the [default configuration file](https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_default.yaml), [minimal configuration file](https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_minimal.yaml) and [dark themed configuration file](https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_dark.yaml). +More settings can be found in the [default configuration file](https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_default.yaml) and [minimal configuration file](https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_minimal.yaml). You find the configuration docs on the advanced usage page [here](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/advanced_usage.html) From 563f52383e21a93ad3df9c23e0dbe57938b8c9b0 Mon Sep 17 00:00:00 2001 From: Simon Brugman Date: Thu, 25 Mar 2021 20:24:17 +0100 Subject: [PATCH 14/40] fix(tests): parse_strings_as_datetimes=True --- tests/unit/test_ge_integration_expectations.py | 1 + 1 file changed, 1 insertion(+) diff --git a/tests/unit/test_ge_integration_expectations.py b/tests/unit/test_ge_integration_expectations.py index 40f3850ca..443f2f1aa 100644 --- a/tests/unit/test_ge_integration_expectations.py +++ b/tests/unit/test_ge_integration_expectations.py @@ -123,6 +123,7 @@ def test_datetime_expectations(batch): "column", min_value=0, max_value=100, + parse_strings_as_datetimes=True, ) From 82444da1c47732c5aacf0a1cb30e0dfa1b8ec840 Mon Sep 17 00:00:00 2001 From: Simon Brugman Date: Fri, 26 Mar 2021 00:29:31 +0100 Subject: [PATCH 15/40] fix(fmt): Formatter exp (#736) Update formatters.py extend tests with exponent formatting --- src/pandas_profiling/report/formatters.py | 2 ++ tests/unit/test_formatters.py | 3 +++ 2 files changed, 5 insertions(+) diff --git a/src/pandas_profiling/report/formatters.py b/src/pandas_profiling/report/formatters.py index 6a29e8bdd..a4fe77ae0 100644 --- a/src/pandas_profiling/report/formatters.py +++ b/src/pandas_profiling/report/formatters.py @@ -206,8 +206,10 @@ def fmt_numeric(value: float, precision=10) -> str: fmtted = f"{{:.{precision}g}}".format(value) for v in ["e+", "e-"]: if v in fmtted: + sign = "-" if v in "e-" else "" fmtted = fmtted.replace(v, " × 10") + "" fmtted = fmtted.replace("0", "") + fmtted = fmtted.replace("", f"{sign}") return fmtted diff --git a/tests/unit/test_formatters.py b/tests/unit/test_formatters.py index 09711cd47..2e103e53d 100644 --- a/tests/unit/test_formatters.py +++ b/tests/unit/test_formatters.py @@ -79,6 +79,9 @@ def test_fmt_array(array, threshold, expected): (81.000000, 10, "81"), (81, 10, "81"), (81.999861123123123123, 10, "81.99986112"), + (1e20, 10, "1 × 1020"), + (1e-20, 10, "1 × 10-20"), + (1e8, 3, "1 × 108"), ], ) def test_fmt_numeric(value, precision, expected): From 29ff64b4ef5c724404c743967512101c820ec014 Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Mon, 22 Mar 2021 16:56:35 +0000 Subject: [PATCH 16/40] [pre-commit.ci] pre-commit autoupdate --- .pre-commit-config.yaml | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 20305ac6d..e41c68345 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -17,12 +17,12 @@ repos: additional_dependencies: [ pyupgrade==2.7.3 ] args: [ --nbqa-mutate, --py36-plus ] - repo: https://github.com/asottile/pyupgrade - rev: v2.10.0 + rev: v2.11.0 hooks: - id: pyupgrade args: ['--py36-plus','--exit-zero-even-if-changed'] - repo: https://github.com/pycqa/isort - rev: 5.7.0 + rev: 5.8.0 hooks: - id: isort files: '.*' @@ -32,7 +32,7 @@ repos: hooks: - id: check-manifest - repo: https://gitlab.com/pycqa/flake8 - rev: "3.8.4" + rev: "3.9.0" hooks: - id: flake8 args: [ "--select=E9,F63,F7,F82"] #,T001 From 44a747670b63aca40e4e7e2b5a023539ca1e58b1 Mon Sep 17 00:00:00 2001 From: Simon Brugman Date: Sat, 27 Mar 2021 20:23:16 +0100 Subject: [PATCH 17/40] Slack links (#738) * Update README.md * Update contribution_guidelines.rst * Update support.rst --- README.md | 4 ++-- docsrc/source/pages/contribution_guidelines.rst | 6 +++++- docsrc/source/pages/support.rst | 4 ++++ 3 files changed, 11 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index e1e5351b4..5dda09329 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,7 @@

Documentation | - Slack + Slack | Stack Overflow

@@ -314,7 +314,7 @@ In the meantime, user customized summarizations and type definitions are now ful Read on getting involved in the [Contribution Guide](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/contribution_guidelines.html). -A low threshold place to ask questions or start contributing is by reaching out on the pandas-profiling Slack. [Join the Slack community](https://join.slack.com/t/pandas-profiling/shared_invite/zt-hfy3iwp2-qEJSItye5QBZf8YGFMaMnQ). +A low threshold place to ask questions or start contributing is by reaching out on the pandas-profiling Slack. [Join the Slack community](https://join.slack.com/t/pandas-profiling/shared_invite/zt-oe5ol4yc-YtbOxNBGUCb~v73TamRLuA). ## Editor integration diff --git a/docsrc/source/pages/contribution_guidelines.rst b/docsrc/source/pages/contribution_guidelines.rst index e0d1b4bc3..2d0b80b42 100644 --- a/docsrc/source/pages/contribution_guidelines.rst +++ b/docsrc/source/pages/contribution_guidelines.rst @@ -9,6 +9,10 @@ Contributing a new feature * Ensure the PR description clearly describes the problem and solution. Include the relevant issue number if applicable. + +Slack community +--------------- +A low threshold place to ask questions or start contributing is by reaching out on the pandas-profiling Slack. `Join the Slack community `_. Developer tools --------------- @@ -61,4 +65,4 @@ Read Github's `open source legal guide `_ on Github. \ No newline at end of file +Read more on getting involved in the `Contribution Guide `_ on Github. diff --git a/docsrc/source/pages/support.rst b/docsrc/source/pages/support.rst index 46ed2e1e4..3f35ac3bd 100644 --- a/docsrc/source/pages/support.rst +++ b/docsrc/source/pages/support.rst @@ -35,6 +35,10 @@ Users with a request for help on how to use `pandas-profiling` should consider a :alt: Questions: Stackoverflow "pandas-profiling" :target: https://stackoverflow.com/questions/tagged/pandas-profiling +Slack community +--------------- + +`Join the Slack community `_ and come into contact with other users and developers, that might be able to answer your questions. Reporting a bug --------------- From 7b796dccf132f06a807017148fc02e3cdf771269 Mon Sep 17 00:00:00 2001 From: sbrugman Date: Sat, 27 Mar 2021 21:25:03 +0100 Subject: [PATCH 18/40] Fix #725 --- src/pandas_profiling/config_default.yaml | 1 + src/pandas_profiling/config_minimal.yaml | 1 + src/pandas_profiling/model/duplicates.py | 11 +++++++-- tests/issues/test_issue725.py | 31 ++++++++++++++++++++++++ 4 files changed, 42 insertions(+), 2 deletions(-) create mode 100644 tests/issues/test_issue725.py diff --git a/src/pandas_profiling/config_default.yaml b/src/pandas_profiling/config_default.yaml index 42c1b1368..ad40d6039 100644 --- a/src/pandas_profiling/config_default.yaml +++ b/src/pandas_profiling/config_default.yaml @@ -150,6 +150,7 @@ memory_deep: False # Configuration related to the duplicates duplicates: head: 10 + key: "# duplicates" # Configuration related to the samples area samples: diff --git a/src/pandas_profiling/config_minimal.yaml b/src/pandas_profiling/config_minimal.yaml index e1aacad3d..1147fd680 100644 --- a/src/pandas_profiling/config_minimal.yaml +++ b/src/pandas_profiling/config_minimal.yaml @@ -151,6 +151,7 @@ memory_deep: False # Configuration related to the duplicates duplicates: head: 0 + key: "# duplicates" # Configuration related to the samples area samples: diff --git a/src/pandas_profiling/model/duplicates.py b/src/pandas_profiling/model/duplicates.py index b81fc232b..72ff83a94 100644 --- a/src/pandas_profiling/model/duplicates.py +++ b/src/pandas_profiling/model/duplicates.py @@ -18,11 +18,18 @@ def get_duplicates(df: pd.DataFrame, supported_columns) -> Optional[pd.DataFrame n_head = config["duplicates"]["head"].get(int) if n_head > 0 and supported_columns: + duplicates_key = config["duplicates"]["key"].get(str) + if duplicates_key in df.columns: + raise ValueError( + f"Duplicates key ({duplicates_key}) may not be part of the DataFrame. Either change the " + f" column name in the DataFrame or change the 'duplicates.key' parameter." + ) + return ( df[df.duplicated(subset=supported_columns, keep=False)] .groupby(supported_columns) .size() - .reset_index(name="count") - .nlargest(n_head, "count") + .reset_index(name=duplicates_key) + .nlargest(n_head, duplicates_key) ) return None diff --git a/tests/issues/test_issue725.py b/tests/issues/test_issue725.py new file mode 100644 index 000000000..c46e37d67 --- /dev/null +++ b/tests/issues/test_issue725.py @@ -0,0 +1,31 @@ +""" +Test for issue 725: +https://github.com/pandas-profiling/pandas-profiling/issues/725 +""" +import numpy as np +import pandas as pd +import pytest + +from pandas_profiling.model.duplicates import get_duplicates + + +@pytest.fixture(scope="module") +def test_data(): + np.random.seed(5) + df = pd.DataFrame( + np.random.randint(1, 100, (100, 5)), + columns=["a", "b", "c", "duplicates", "count"], + ) + df = pd.concat([df, df], axis=0) + return df + + +def test_issue725(test_data): + duplicates = get_duplicates(test_data, list(test_data.columns)) + assert set(duplicates.columns) == set(test_data.columns).union({"# duplicates"}) + + +def test_issue725_existing(test_data): + test_data = test_data.rename(columns={"count": "# duplicates"}) + with pytest.raises(ValueError): + _ = get_duplicates(test_data, list(test_data.columns)) From 821d2065bb40fdbd422582277d5718ec849209a7 Mon Sep 17 00:00:00 2001 From: Simon Brugman Date: Mon, 29 Mar 2021 00:03:12 +0200 Subject: [PATCH 19/40] feat: Allow empty dataframe (#740) Fixes #678: allow empty dataframe Co-authored-by: fwd2020-c <74557237+fwd2020-c@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> --- docsrc/source/pages/changelog/v2_13_0.rst | 6 ++-- src/pandas_profiling/model/correlations.py | 3 ++ src/pandas_profiling/model/describe.py | 3 -- src/pandas_profiling/model/messages.py | 11 +++++++ src/pandas_profiling/model/sample.py | 9 ++++-- src/pandas_profiling/model/summary.py | 11 +++++-- .../model/summary_algorithms.py | 4 +-- .../warnings/warning_duplicates.html | 2 +- .../templates/warnings/warning_empty.html | 1 + .../presentation/flavours/widget/warnings.py | 1 + tests/unit/test_decorator.py | 9 ------ tests/unit/test_describe.py | 6 ---- tests/unit/test_example.py | 13 +++++++++ tests/unit/test_summary.py | 12 ++++++++ tests/unit/test_summary_algos.py | 29 ++++++++++++++++++- 15 files changed, 90 insertions(+), 30 deletions(-) create mode 100644 src/pandas_profiling/report/presentation/flavours/html/templates/warnings/warning_empty.html create mode 100644 tests/unit/test_summary.py diff --git a/docsrc/source/pages/changelog/v2_13_0.rst b/docsrc/source/pages/changelog/v2_13_0.rst index f2d44616a..4dbf7b73a 100644 --- a/docsrc/source/pages/changelog/v2_13_0.rst +++ b/docsrc/source/pages/changelog/v2_13_0.rst @@ -1,9 +1,9 @@ -Changelog vx.y.z ----------------- +Changelog v2.13.0 +----------------- 🎉 Features ^^^^^^^^^^^ -- +- Allow empty data frames `[678] `_ "contributed by @spbail, @fwd2020-c" 🐛 Bug fixes ^^^^^^^^^^^^ diff --git a/src/pandas_profiling/model/correlations.py b/src/pandas_profiling/model/correlations.py index 99756ed71..ee64edc0d 100644 --- a/src/pandas_profiling/model/correlations.py +++ b/src/pandas_profiling/model/correlations.py @@ -155,6 +155,9 @@ def calculate_correlation( The correlation matrices for the given correlation measures. Return None if correlation is empty. """ + if len(df) == 0: + return None + correlation_measures = { "pearson": Pearson, "spearman": Spearman, diff --git a/src/pandas_profiling/model/describe.py b/src/pandas_profiling/model/describe.py index 80b53d253..a9deb0d48 100644 --- a/src/pandas_profiling/model/describe.py +++ b/src/pandas_profiling/model/describe.py @@ -47,9 +47,6 @@ def describe( if not isinstance(df, pd.DataFrame): warnings.warn("df is not of type pandas.DataFrame") - if df.empty: - raise ValueError("df can not be empty") - disable_progress_bar = not config["progress_bar"].get(bool) date_start = datetime.utcnow() diff --git a/src/pandas_profiling/model/messages.py b/src/pandas_profiling/model/messages.py index 3330c049e..7f41dc865 100644 --- a/src/pandas_profiling/model/messages.py +++ b/src/pandas_profiling/model/messages.py @@ -56,6 +56,9 @@ class MessageType(Enum): UNIFORM = auto() """The variable is uniformly distributed""" + EMPTY = auto() + """The DataFrame is empty""" + class Message: """A message object (type, values, column).""" @@ -117,6 +120,14 @@ def check_table_messages(table: dict) -> List[Message]: fields={"n_duplicates"}, ) ) + if table["n"] == 0: + messages.append( + Message( + message_type=MessageType.EMPTY, + values=table, + fields={"n"}, + ) + ) return messages diff --git a/src/pandas_profiling/model/sample.py b/src/pandas_profiling/model/sample.py index 1df2acc78..50fac9397 100644 --- a/src/pandas_profiling/model/sample.py +++ b/src/pandas_profiling/model/sample.py @@ -1,3 +1,5 @@ +from typing import List + import attr import pandas as pd @@ -12,7 +14,7 @@ class Sample: caption = attr.ib(default=None) -def get_sample(df: pd.DataFrame) -> list: +def get_sample(df: pd.DataFrame) -> List[Sample]: """Obtains a sample from head and tail of the DataFrame Args: @@ -21,7 +23,10 @@ def get_sample(df: pd.DataFrame) -> list: Returns: a list of Sample objects """ - samples = [] + samples: List[Sample] = [] + if len(df) == 0: + return samples + n_head = config["samples"]["head"].get(int) if n_head > 0: samples.append(Sample("head", df.head(n=n_head), "First rows")) diff --git a/src/pandas_profiling/model/summary.py b/src/pandas_profiling/model/summary.py index a579275c1..b5c2191a2 100644 --- a/src/pandas_profiling/model/summary.py +++ b/src/pandas_profiling/model/summary.py @@ -124,7 +124,7 @@ def get_table_stats(df: pd.DataFrame, variable_stats: dict) -> dict: n = len(df) memory_size = df.memory_usage(deep=config["memory_deep"].get(bool)).sum() - record_size = float(memory_size) / n + record_size = float(memory_size) / n if n > 0 else 0 table_stats = { "n": n, @@ -143,8 +143,10 @@ def get_table_stats(df: pd.DataFrame, variable_stats: dict) -> dict: if series_summary["n_missing"] == n: table_stats["n_vars_all_missing"] += 1 - table_stats["p_cells_missing"] = table_stats["n_cells_missing"] / ( - table_stats["n"] * table_stats["n_var"] + table_stats["p_cells_missing"] = ( + table_stats["n_cells_missing"] / (table_stats["n"] * table_stats["n_var"]) + if table_stats["n"] > 0 + else 0 ) supported_columns = [ @@ -203,6 +205,9 @@ def get_missing_diagrams(df: pd.DataFrame, table_stats: dict) -> dict: A dictionary containing the base64 encoded plots for each diagram that is active in the config (matrix, bar, heatmap, dendrogram). """ + if len(df) == 0: + return {} + def warn_missing(missing_name, error): warnings.warn( f"""There was an attempt to generate the {missing_name} missing values diagrams, but this failed. diff --git a/src/pandas_profiling/model/summary_algorithms.py b/src/pandas_profiling/model/summary_algorithms.py index b5150228d..768829736 100644 --- a/src/pandas_profiling/model/summary_algorithms.py +++ b/src/pandas_profiling/model/summary_algorithms.py @@ -96,7 +96,7 @@ def describe_supported( stats = { "n_distinct": distinct_count, "p_distinct": distinct_count / count if count > 0 else 0, - "is_unique": unique_count == count, + "is_unique": unique_count == count and count > 0, "n_unique": unique_count, "p_unique": unique_count / count if count > 0 else 0, } @@ -120,7 +120,7 @@ def describe_generic(series: pd.Series, summary: dict) -> Tuple[pd.Series, dict] summary.update( { "n": length, - "p_missing": summary["n_missing"] / length, + "p_missing": summary["n_missing"] / length if length > 0 else 0, "count": length - summary["n_missing"], "memory_size": series.memory_usage(deep=config["memory_deep"].get(bool)), } diff --git a/src/pandas_profiling/report/presentation/flavours/html/templates/warnings/warning_duplicates.html b/src/pandas_profiling/report/presentation/flavours/html/templates/warnings/warning_duplicates.html index 8820e1d50..59bb93c56 100644 --- a/src/pandas_profiling/report/presentation/flavours/html/templates/warnings/warning_duplicates.html +++ b/src/pandas_profiling/report/presentation/flavours/html/templates/warnings/warning_duplicates.html @@ -1 +1 @@ -Dataset has {{ message.values['n_duplicates'] }} ({{ message.values['p_duplicates'] | fmt_percent }}) duplicate rows \ No newline at end of file +Dataset has {{ message.values['n_duplicates'] }} ({{ message.values['p_duplicates'] | fmt_percent }}) duplicate rows \ No newline at end of file diff --git a/src/pandas_profiling/report/presentation/flavours/html/templates/warnings/warning_empty.html b/src/pandas_profiling/report/presentation/flavours/html/templates/warnings/warning_empty.html new file mode 100644 index 000000000..a676c9577 --- /dev/null +++ b/src/pandas_profiling/report/presentation/flavours/html/templates/warnings/warning_empty.html @@ -0,0 +1 @@ +Dataset is empty diff --git a/src/pandas_profiling/report/presentation/flavours/widget/warnings.py b/src/pandas_profiling/report/presentation/flavours/widget/warnings.py index 97f830014..43b959f79 100644 --- a/src/pandas_profiling/report/presentation/flavours/widget/warnings.py +++ b/src/pandas_profiling/report/presentation/flavours/widget/warnings.py @@ -25,6 +25,7 @@ def render(self): "skewed": "info", "high_correlation": "", "duplicates": "", + "empty": "", } items = [] diff --git a/tests/unit/test_decorator.py b/tests/unit/test_decorator.py index 25d0a58bc..57f3dc5e0 100644 --- a/tests/unit/test_decorator.py +++ b/tests/unit/test_decorator.py @@ -1,5 +1,4 @@ import pandas as pd -import pytest import pandas_profiling @@ -16,11 +15,3 @@ def test_decorator(get_data_file): missing_diagrams={"heatmap": False, "dendrogram": False}, ) assert "Coursera Test Report" in report.to_html(), "Title is not found" - - -def test_empty_decorator(): - df = pd.DataFrame().profile_report(progress_bar=False) - with pytest.raises(ValueError) as e: - df.get_description() - - assert e.value.args[0] == "df can not be empty" diff --git a/tests/unit/test_describe.py b/tests/unit/test_describe.py index 1d6589df3..cbec917f8 100644 --- a/tests/unit/test_describe.py +++ b/tests/unit/test_describe.py @@ -571,12 +571,6 @@ def test_describe_df(column, describe_data, expected_results, summarizer, typese ), f"Histogram missing for column {column}" -def test_describe_empty(summarizer, typeset): - empty_frame = pd.DataFrame() - with pytest.raises(ValueError): - describe("", empty_frame, summarizer, typeset) - - def test_describe_list(summarizer, typeset): with pytest.raises(AttributeError): with pytest.warns(UserWarning): diff --git a/tests/unit/test_example.py b/tests/unit/test_example.py index cbb72a6ee..8b1487543 100644 --- a/tests/unit/test_example.py +++ b/tests/unit/test_example.py @@ -50,3 +50,16 @@ def test_example(get_data_file, test_output_dir): and len(profile.get_description().items()) == 10 ), "Unexpected result" assert "12" in profile.to_html() + + +def test_example_empty(): + df = pd.DataFrame({"A": [], "B": []}) + profile = ProfileReport(df) + description = profile.get_description() + + assert len(description["correlations"]) == 0 + assert len(description["missing"]) == 0 + assert len(description["sample"]) == 0 + + html = profile.to_html() + assert "Dataset is empty" in html diff --git a/tests/unit/test_summary.py b/tests/unit/test_summary.py new file mode 100644 index 000000000..bf25b3cf8 --- /dev/null +++ b/tests/unit/test_summary.py @@ -0,0 +1,12 @@ +import pandas as pd + +from pandas_profiling.model.summary import get_table_stats + + +def test_get_table_stats_empty_df(): + df = pd.DataFrame({"A": [], "B": []}) + table_stats = get_table_stats(df, {}) + assert table_stats["n"] == 0 + assert table_stats["p_cells_missing"] == 0 + assert table_stats["n_duplicates"] == 0 + assert table_stats["p_duplicates"] == 0 diff --git a/tests/unit/test_summary_algos.py b/tests/unit/test_summary_algos.py index 98460bec6..ec5846670 100644 --- a/tests/unit/test_summary_algos.py +++ b/tests/unit/test_summary_algos.py @@ -1,7 +1,12 @@ import numpy as np import pandas as pd +import pytest -from pandas_profiling.model.summary_algorithms import describe_counts +from pandas_profiling.model.summary_algorithms import ( + describe_counts, + describe_generic, + describe_supported, +) def test_count_summary_sorted(): @@ -24,3 +29,25 @@ def test_count_summary_category(): ) sn, r = describe_counts(s, {}) assert len(r["value_counts_without_nan"].index) == 2 + + +@pytest.fixture(scope="class") +def empty_data() -> pd.DataFrame: + return pd.DataFrame({"A": []}) + + +def test_summary_supported_empty_df(empty_data): + series, summary = describe_counts(empty_data["A"], {}) + assert summary["n_missing"] == 0 + assert "p_missing" not in summary + + series, summary = describe_generic(series, summary) + assert summary["n_missing"] == 0 + assert summary["p_missing"] == 0 + assert summary["count"] == 0 + + _, summary = describe_supported(series, summary) + assert summary["n_distinct"] == 0 + assert summary["p_distinct"] == 0 + assert summary["n_unique"] == 0 + assert not summary["is_unique"] From cade4802c709f32813da8ed2d050d2d98bc6818a Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Mon, 5 Apr 2021 16:59:53 +0000 Subject: [PATCH 20/40] [pre-commit.ci] pre-commit autoupdate --- .pre-commit-config.yaml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index e41c68345..ff93d66d7 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -5,7 +5,7 @@ repos: - id: black language_version: python3.8 - repo: https://github.com/nbQA-dev/nbQA - rev: 0.5.9 + rev: 0.6.0 hooks: - id: nbqa-black additional_dependencies: [ black==20.8b1 ] @@ -31,7 +31,7 @@ repos: rev: "0.46" hooks: - id: check-manifest -- repo: https://gitlab.com/pycqa/flake8 +- repo: https://github.com/PyCQA/flake8 rev: "3.9.0" hooks: - id: flake8 From e9796657dfe126c8cece2363f86a2becd56f9a60 Mon Sep 17 00:00:00 2001 From: Simon Brugman Date: Wed, 7 Apr 2021 18:43:15 +0200 Subject: [PATCH 21/40] ci(benchmark): Benchmark introduction (#753) --- .github/workflows/benchmark.yml | 37 ++++++++++ requirements-test.txt | 1 + tests/benchmarks/bench.py | 74 +++++++++++++++++++ tests/performance/time_inf.py | 25 ------- tests/performance/time_kurtosis.py | 36 --------- tests/performance/time_mad.py | 56 -------------- tests/performance/time_mean.py | 36 --------- tests/performance/timings.py | 113 ----------------------------- 8 files changed, 112 insertions(+), 266 deletions(-) create mode 100644 .github/workflows/benchmark.yml create mode 100644 tests/benchmarks/bench.py delete mode 100644 tests/performance/time_inf.py delete mode 100644 tests/performance/time_kurtosis.py delete mode 100644 tests/performance/time_mad.py delete mode 100644 tests/performance/time_mean.py delete mode 100644 tests/performance/timings.py diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml new file mode 100644 index 000000000..24c1bf373 --- /dev/null +++ b/.github/workflows/benchmark.yml @@ -0,0 +1,37 @@ +name: Performance Benchmarks + +on: + push: + branches: + - master + - develop + +jobs: + benchmark: + name: ${{ matrix.os }} x ${{ matrix.python }} + runs-on: ${{ matrix.os }} + strategy: + fail-fast: false + matrix: + os: [ ubuntu-latest ] #, macos-latest, windows-latest ] + python: ['3.8'] + steps: + - uses: actions/checkout@v2 + with: + fetch-depth: 0 + - uses: actions/setup-python@v1 + with: + python-version: ${{ matrix.python }} + - name: Run benchmark + run: | + pip install -r requirements.txt + pip install -r requirements-test.txt + pytest tests/benchmarks/bench.py --benchmark-json benchmark.json + - name: Store benchmark result + uses: rhysd/github-action-benchmark@v1 + with: + name: Pandas Profiling Benchmarks + tool: 'pytest' + output-file-path: benchmark.json + github-token: ${{ secrets.GITHUB_TOKEN }} + auto-push: true \ No newline at end of file diff --git a/requirements-test.txt b/requirements-test.txt index 89c4b5df1..137ae3666 100644 --- a/requirements-test.txt +++ b/requirements-test.txt @@ -3,6 +3,7 @@ coverage<5 codecov pytest-mypy pytest-cov +pytest-benchmark~=3.2.2 nbval pyarrow flake8 diff --git a/tests/benchmarks/bench.py b/tests/benchmarks/bench.py new file mode 100644 index 000000000..9cec766d1 --- /dev/null +++ b/tests/benchmarks/bench.py @@ -0,0 +1,74 @@ +import pandas as pd + +from pandas_profiling import ProfileReport +from pandas_profiling.utils.cache import cache_file + + +def test_titanic_explorative(benchmark): + file_name = cache_file( + "titanic.parquet", + "https://github.com/pandas-profiling/pandas-profiling-data/raw/master/data/titanic.parquet", + ) + + data = pd.read_parquet(file_name) + + def func(df): + profile = ProfileReport( + df, title="Titanic Dataset", explorative=True, progress_bar=False + ) + report = profile.to_html() + return report + + benchmark(func, data) + + +def test_titanic_default(benchmark): + file_name = cache_file( + "titanic.parquet", + "https://github.com/pandas-profiling/pandas-profiling-data/raw/master/data/titanic.parquet", + ) + + data = pd.read_parquet(file_name) + + def func(df): + profile = ProfileReport(df, title="Titanic Dataset", progress_bar=False) + report = profile.to_html() + return report + + benchmark(func, data) + + +def test_titanic_minimal(benchmark): + file_name = cache_file( + "titanic.parquet", + "https://github.com/pandas-profiling/pandas-profiling-data/raw/master/data/titanic.parquet", + ) + + data = pd.read_parquet(file_name) + + def func(df): + profile = ProfileReport( + df, title="Titanic Dataset", minimal=True, progress_bar=False + ) + report = profile.to_html() + return report + + benchmark(func, data) + + +def test_rdw_minimal(benchmark): + file_name = cache_file( + "rdw.parquet", + "https://github.com/pandas-profiling/pandas-profiling-data/raw/master/data/rdw.parquet", + ) + + data = pd.read_parquet(file_name) + + def func(df): + profile = ProfileReport( + df, title="RDW Dataset", minimal=True, progress_bar=False + ) + report = profile.to_html() + return report + + benchmark(func, data) diff --git a/tests/performance/time_inf.py b/tests/performance/time_inf.py deleted file mode 100644 index ba2aecaa4..000000000 --- a/tests/performance/time_inf.py +++ /dev/null @@ -1,25 +0,0 @@ -import timeit - -testcode = """ -import numpy as np -import pandas as pd - -np.random.seed(12) -vals = np.random.random(10000) -series = pd.Series(vals) -series[series < 0.3] = np.nan -series[series < 0.2] = np.Inf - - - -def f1(series): - return len(series.loc[(~np.isfinite(series)) & series.notnull()]) - - -def f2(series): - return ((series == np.inf) | (series == -np.inf)).sum() -""" - - -print(timeit.timeit("f1(series)", number=10, setup=testcode)) -print(timeit.timeit("f2(series)", number=10, setup=testcode)) diff --git a/tests/performance/time_kurtosis.py b/tests/performance/time_kurtosis.py deleted file mode 100644 index dfa106272..000000000 --- a/tests/performance/time_kurtosis.py +++ /dev/null @@ -1,36 +0,0 @@ -import timeit - -testcode = """ -import numpy as np -import pandas as pd -import scipy.stats - -np.random.seed(12) -vals = np.random.random(1000) -series = pd.Series(vals) -series[series < 0.2] = pd.NA - -def f1(series): - arr = series.values - return scipy.stats.kurtosis(arr, bias=False, nan_policy='omit') - - -def f2(series): - arr = series.values - arr_without_nan = arr[~np.isnan(arr)] - return scipy.stats.kurtosis(arr_without_nan, bias=False) - - -def f3(series): - return series.kurtosis() - - -def f4(series): - return series[series.notna()].kurtosis() -""" - - -print(timeit.timeit("f1(series)", number=10, setup=testcode)) -print(timeit.timeit("f2(series)", number=10, setup=testcode)) -print(timeit.timeit("f3(series)", number=10, setup=testcode)) -print(timeit.timeit("f4(series)", number=10, setup=testcode)) diff --git a/tests/performance/time_mad.py b/tests/performance/time_mad.py deleted file mode 100644 index 8c6107614..000000000 --- a/tests/performance/time_mad.py +++ /dev/null @@ -1,56 +0,0 @@ -import timeit - -testcode = ''' -import numpy as np -import pandas as pd - -np.random.seed(12) -vals = np.random.random(1000) -series = pd.Series(vals) -series[series < 0.2] = pd.NA - - -def mad(arr): - """ Median Absolute Deviation: a "Robust" version of standard deviation. - Indices variabililty of the sample. - https://en.wikipedia.org/wiki/Median_absolute_deviation - """ - arr = np.ma.array(arr).compressed() # should be faster to not use masked arrays. - med = np.median(arr) - return np.median(np.abs(arr - med)) - - -def mad2(arr): - """ Median Absolute Deviation: a "Robust" version of standard deviation. - Indices variabililty of the sample. - https://en.wikipedia.org/wiki/Median_absolute_deviation - """ - med = np.median(arr) - return np.median(np.abs(arr - med)) - - -def f1(series): - arr = series.values - arr_without_nan = arr[~np.isnan(arr)] - return mad(arr_without_nan) - - -def f2(series): - arr = series.values - arr_without_nan = arr[~np.isnan(arr)] - return mad(arr_without_nan) - - -def f3(series): - return series.mad() - - -def f4(series): - return series[series.notna()].mad() -''' - - -print(timeit.timeit("f1(series)", number=10, setup=testcode)) -print(timeit.timeit("f2(series)", number=10, setup=testcode)) -print(timeit.timeit("f3(series)", number=10, setup=testcode)) -print(timeit.timeit("f4(series)", number=10, setup=testcode)) diff --git a/tests/performance/time_mean.py b/tests/performance/time_mean.py deleted file mode 100644 index f6149a4c0..000000000 --- a/tests/performance/time_mean.py +++ /dev/null @@ -1,36 +0,0 @@ -import timeit - -testcode = """ -import numpy as np -import pandas as pd - -np.random.seed(12) -vals = np.random.random(1000) -series = pd.Series(vals) -series[series < 0.2] = pd.NA - - -def f1(series): - arr = series.values - arr_without_nan = arr[~np.isnan(arr)] - return np.mean(arr_without_nan) - - -def f2(series): - arr = series.values - return np.nanmean(arr) - - -def f3(series): - return series.mean() - - -def f4(series): - return series[series.notna()].mean() -""" - - -print(timeit.timeit("f1(series)", number=10, setup=testcode)) -print(timeit.timeit("f2(series)", number=10, setup=testcode)) -print(timeit.timeit("f3(series)", number=10, setup=testcode)) -print(timeit.timeit("f4(series)", number=10, setup=testcode)) diff --git a/tests/performance/timings.py b/tests/performance/timings.py deleted file mode 100644 index acde9360d..000000000 --- a/tests/performance/timings.py +++ /dev/null @@ -1,113 +0,0 @@ -import timeit -from itertools import product -from string import ascii_lowercase - -import numpy as np -import pandas as pd -import seaborn as sns -from matplotlib import pyplot as plt - -from pandas_profiling import ProfileReport - - -def generate_column_names(n): - column_names = [] - iters = 1 - while len(column_names) < n: - column_names += list( - "".join(combo) for combo in product(ascii_lowercase, repeat=iters) - ) - iters += 1 - return column_names - - -def make_sample_data(cols, rows): - column_names = generate_column_names(cols) - - df = pd.DataFrame( - np.random.randint(0, 1000000, size=(rows, cols)), columns=column_names[0:cols] - ) - df = df.astype(str) - - assert df.shape == (rows, cols) - return df.copy() - - -def make_report_minimal(df): - report = ProfileReport( - df, - minimal=True, - pool_size=0, - sort="None", - title="Dataset with Numeric Categories", - ) - html = report.to_html() - assert type(html) == str and '

Dataset info

' in html - - -def make_report(df): - report = ProfileReport( - df, - minimal=False, - pool_size=0, - sort="None", - title="Dataset with Numeric Categories", - ) - html = report.to_html() - assert type(html) == str and '

Dataset info

' in html - - -def wrap_func(function): - def inner(df): - def double_inner(): - return function(df) - - return double_inner - - return inner - - -def time_report(func, cols, rows, runs=5): - df = make_sample_data(cols, rows) - print(df.shape) - test = wrap_func(func)(df.copy()) - return timeit.timeit(test, number=runs) / runs - - -def plot_col_run_time(): - cols = [2, 4, 10, 50] - row = 1000 - default_times = [time_report(make_report, col, row) for col in cols] - minimal_times = [time_report(make_report_minimal, col, row) for col in cols] - - ax1 = sns.scatterplot(cols, default_times) - ax2 = sns.scatterplot(cols, minimal_times) - _ = ax1.set( - xlabel=f"Number of columns (row={row})", - ylabel="time (s)", - title="Run Time Complexity", - ) - plt.show() - - -def plot_row_run_time(): - # 10, 100 - # https://github.com/pandas-profiling/pandas-profiling/issues/270 - rows = [1000, 10000, 100000] - col = 10 - default_times = [time_report(make_report, col, row) for row in rows] - minimal_times = [time_report(make_report_minimal, col, row) for row in rows] - - ax1 = sns.scatterplot(rows, default_times) - ax2 = sns.scatterplot(rows, minimal_times) - _ = ax1.set( - xlabel=f"Number of rows (col={col})", - ylabel="time (s)", - title="Run Time Complexity", - ) - plt.show() - - -if __name__ == "__main__": - plot_col_run_time() - plot_row_run_time() From e6844e58da9660b1a789c648f3c7e203fd0380ab Mon Sep 17 00:00:00 2001 From: sbrugman Date: Wed, 7 Apr 2021 18:45:37 +0200 Subject: [PATCH 22/40] ci(benchmark): update installation --- .github/workflows/benchmark.yml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml index 24c1bf373..f9695122a 100644 --- a/.github/workflows/benchmark.yml +++ b/.github/workflows/benchmark.yml @@ -24,9 +24,11 @@ jobs: python-version: ${{ matrix.python }} - name: Run benchmark run: | + pip install --upgrade pip setuptools wheel pip install -r requirements.txt pip install -r requirements-test.txt - pytest tests/benchmarks/bench.py --benchmark-json benchmark.json + - run: make install + - run: pytest tests/benchmarks/bench.py --benchmark-json benchmark.json - name: Store benchmark result uses: rhysd/github-action-benchmark@v1 with: From 011e3f77a4a22882b4f8ccd1b7e0c505142009c8 Mon Sep 17 00:00:00 2001 From: sbrugman Date: Wed, 7 Apr 2021 18:54:37 +0200 Subject: [PATCH 23/40] ci(benchmark): github actions SIGKILL due to memory usage of benchmarks --- tests/benchmarks/bench.py | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/tests/benchmarks/bench.py b/tests/benchmarks/bench.py index 9cec766d1..245b23c0f 100644 --- a/tests/benchmarks/bench.py +++ b/tests/benchmarks/bench.py @@ -56,19 +56,19 @@ def func(df): benchmark(func, data) -def test_rdw_minimal(benchmark): - file_name = cache_file( - "rdw.parquet", - "https://github.com/pandas-profiling/pandas-profiling-data/raw/master/data/rdw.parquet", - ) - - data = pd.read_parquet(file_name) - - def func(df): - profile = ProfileReport( - df, title="RDW Dataset", minimal=True, progress_bar=False - ) - report = profile.to_html() - return report - - benchmark(func, data) +# def test_rdw_minimal(benchmark): +# file_name = cache_file( +# "rdw.parquet", +# "https://github.com/pandas-profiling/pandas-profiling-data/raw/master/data/rdw.parquet", +# ) +# +# data = pd.read_parquet(file_name) +# +# def func(df): +# profile = ProfileReport( +# df, title="RDW Dataset", minimal=True, progress_bar=False +# ) +# report = profile.to_html() +# return report +# +# benchmark(func, data) From bbea211ee34763b5a713917fc5a99e8d840f48bb Mon Sep 17 00:00:00 2001 From: Simon Brugman Date: Wed, 7 Apr 2021 19:09:24 +0200 Subject: [PATCH 24/40] Split tests and coverage in CI/CD (#754) * Split tests and coverage in CI/CD Co-authored-by: chanedwin --- .github/workflows/{ci.yml => release.yml} | 2 +- .github/workflows/{ci_test.yml => tests.yml} | 52 ++++++++++++++++++-- Makefile | 4 +- 3 files changed, 53 insertions(+), 5 deletions(-) rename .github/workflows/{ci.yml => release.yml} (99%) rename .github/workflows/{ci_test.yml => tests.yml} (56%) diff --git a/.github/workflows/ci.yml b/.github/workflows/release.yml similarity index 99% rename from .github/workflows/ci.yml rename to .github/workflows/release.yml index 48e1aa31b..e286aa439 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/release.yml @@ -1,4 +1,4 @@ -name: CI +name: Release CI on: push: diff --git a/.github/workflows/ci_test.yml b/.github/workflows/tests.yml similarity index 56% rename from .github/workflows/ci_test.yml rename to .github/workflows/tests.yml index cd797af42..95fbbbc21 100644 --- a/.github/workflows/ci_test.yml +++ b/.github/workflows/tests.yml @@ -1,9 +1,9 @@ -name: Tests and Coverage +name: CI on: push jobs: - build: + test: runs-on: ${{ matrix.os }} strategy: matrix: @@ -33,7 +33,53 @@ jobs: pandas: "pandas>1.1" numpy: "numpy" - name: python ${{ matrix.python-version }}, ${{ matrix.os }}, ${{ matrix.pandas }}, ${{ matrix.numpy }} + name: Tests | python ${{ matrix.python-version }}, ${{ matrix.os }}, ${{ matrix.pandas }}, ${{ matrix.numpy }} + steps: + - uses: actions/checkout@v2 + - name: Setup python + uses: actions/setup-python@v2 + with: + python-version: ${{ matrix.python-version }} + architecture: x64 + - uses: actions/cache@v2 + if: startsWith(runner.os, 'Linux') + with: + path: ~/.cache/pip + key: ${{ runner.os }}-${{ matrix.pandas }}-pip-${{ hashFiles('**/requirements.txt') }} + restore-keys: | + ${{ runner.os }}-${{ matrix.pandas }}-pip- + + - uses: actions/cache@v2 + if: startsWith(runner.os, 'macOS') + with: + path: ~/Library/Caches/pip + key: ${{ runner.os }}-${{ matrix.pandas }}-pip-${{ hashFiles('**/requirements.txt') }} + restore-keys: | + ${{ runner.os }}-${{ matrix.pandas }}-pip- + + - uses: actions/cache@v2 + if: startsWith(runner.os, 'Windows') + with: + path: ~\AppData\Local\pip\Cache + key: ${{ runner.os }}-${{ matrix.pandas }}-pip-${{ hashFiles('**/requirements.txt') }} + restore-keys: | + ${{ runner.os }}-${{ matrix.pandas }}-pip- + - run: | + pip install --upgrade pip setuptools wheel + pip install -r requirements.txt "${{ matrix.pandas }}" "${{ matrix.numpy }}" + pip install -r requirements-test.txt + - run: make install + - run: make test + coverage: + runs-on: ${{ matrix.os }} + strategy: + matrix: + os: [ ubuntu-latest ] + python-version: [ 3.8 ] + pandas: [ "pandas>1.1"] + numpy: ["numpy"] + + name: Coverage | python ${{ matrix.python-version }}, ${{ matrix.os }}, ${{ matrix.pandas }}, ${{ matrix.numpy }} steps: - uses: actions/checkout@v2 - name: Setup python diff --git a/Makefile b/Makefile index bbcd539e9..3a8e2d836 100644 --- a/Makefile +++ b/Makefile @@ -16,7 +16,9 @@ test: pytest tests/issues/ pytest --nbval tests/notebooks/ flake8 . --select=E9,F63,F7,F82 --show-source --statistics - + pandas_profiling -h + make typing + test_cov: pytest --cov=. tests/unit/ pytest --cov=. --cov-append tests/issues/ From ed85c200692eb197e03a8514ee4824553cf7508c Mon Sep 17 00:00:00 2001 From: sbrugman Date: Wed, 7 Apr 2021 19:39:04 +0200 Subject: [PATCH 25/40] ci(benchmark): alert on performance regression --- .github/workflows/benchmark.yml | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml index f9695122a..94ae74ae5 100644 --- a/.github/workflows/benchmark.yml +++ b/.github/workflows/benchmark.yml @@ -36,4 +36,7 @@ jobs: tool: 'pytest' output-file-path: benchmark.json github-token: ${{ secrets.GITHUB_TOKEN }} - auto-push: true \ No newline at end of file + auto-push: true + + comment-on-alert: true + alert-comment-cc-users: '@sbrugman' From 6f90be0cd74708d490ce27b019c5dd74ba0f80c0 Mon Sep 17 00:00:00 2001 From: sbrugman Date: Wed, 7 Apr 2021 19:45:47 +0200 Subject: [PATCH 26/40] ci(benchmark): add RDW 100k sample --- tests/benchmarks/bench.py | 65 +++++++++++++++------------------------ 1 file changed, 25 insertions(+), 40 deletions(-) diff --git a/tests/benchmarks/bench.py b/tests/benchmarks/bench.py index 245b23c0f..a1db2a0b0 100644 --- a/tests/benchmarks/bench.py +++ b/tests/benchmarks/bench.py @@ -1,9 +1,17 @@ +from functools import partial + import pandas as pd from pandas_profiling import ProfileReport from pandas_profiling.utils.cache import cache_file +def func(df, **kwargs): + profile = ProfileReport(df, progress_bar=False, **kwargs) + report = profile.to_html() + return report + + def test_titanic_explorative(benchmark): file_name = cache_file( "titanic.parquet", @@ -12,14 +20,8 @@ def test_titanic_explorative(benchmark): data = pd.read_parquet(file_name) - def func(df): - profile = ProfileReport( - df, title="Titanic Dataset", explorative=True, progress_bar=False - ) - report = profile.to_html() - return report - - benchmark(func, data) + kwargs = dict(explorative=True) + benchmark(partial(func, **kwargs), data) def test_titanic_default(benchmark): @@ -30,12 +32,7 @@ def test_titanic_default(benchmark): data = pd.read_parquet(file_name) - def func(df): - profile = ProfileReport(df, title="Titanic Dataset", progress_bar=False) - report = profile.to_html() - return report - - benchmark(func, data) + benchmark(partial(func), data) def test_titanic_minimal(benchmark): @@ -46,29 +43,17 @@ def test_titanic_minimal(benchmark): data = pd.read_parquet(file_name) - def func(df): - profile = ProfileReport( - df, title="Titanic Dataset", minimal=True, progress_bar=False - ) - report = profile.to_html() - return report - - benchmark(func, data) - - -# def test_rdw_minimal(benchmark): -# file_name = cache_file( -# "rdw.parquet", -# "https://github.com/pandas-profiling/pandas-profiling-data/raw/master/data/rdw.parquet", -# ) -# -# data = pd.read_parquet(file_name) -# -# def func(df): -# profile = ProfileReport( -# df, title="RDW Dataset", minimal=True, progress_bar=False -# ) -# report = profile.to_html() -# return report -# -# benchmark(func, data) + kwargs = dict(minimal=True) + benchmark(partial(func, **kwargs), data) + + +def test_rdw_minimal(benchmark): + file_name = cache_file( + "rdw_sample_100k.parquet", + "https://github.com/pandas-profiling/pandas-profiling-data/raw/master/data/rdw_sample_100k.parquet", + ) + + data = pd.read_parquet(file_name) + + kwargs = dict(minimal=True) + benchmark(partial(func, **kwargs), data) From c7111d7b543e07807061758664dd07afc05b1a69 Mon Sep 17 00:00:00 2001 From: Jan Kadlec <54404810+jankaWIS@users.noreply.github.com> Date: Wed, 7 Apr 2021 21:26:24 +0300 Subject: [PATCH 27/40] docs(config): update docs - customise plots in report (#742) * update docs - customise plots in report --- docsrc/source/pages/advanced_usage.rst | 72 ++++++++++++++++++++++++++ 1 file changed, 72 insertions(+) diff --git a/docsrc/source/pages/advanced_usage.rst b/docsrc/source/pages/advanced_usage.rst index 8f7099ef6..ba39f6bbb 100644 --- a/docsrc/source/pages/advanced_usage.rst +++ b/docsrc/source/pages/advanced_usage.rst @@ -165,3 +165,75 @@ It's possible to disable certain groups of features through configuration shorth r.set_variable("correlations", None) r.set_variable("missing_diagrams", None) r.set_variable("interactions", None) + + + + +Customise plots +--------------- + +A way how to pass arguments to the underlying matplotlib is to use the ``plot`` argument. It is possible to change the default format of images to png (default svg) using the key-pair ``image_format: "png"`` and also the resolution of the image using ``dpi: 800``. + +An example would be: + +.. code-block:: python + + profile = ProfileReport(planets, title='Pandas Profiling Report', explorative=True, + plot={ + 'dpi':200, + 'image_format': 'png' + }) + + +Furthermore, it is possible to change the default values of histograms, the options for that are the following: + + histogram: + x_axis_labels: True + + # Number of bins (set to 0 to automatically detect the bin size) + bins: 50 + + # Maximum number of bins (when bins=0) + max_bins: 250 + + + + + +Customise correlation matrix +----------------------------- + +It's possible to directly access the correlation matrix as well. That is done with the ``plot`` argument and then with the `correlation` key. It is possible to customise the palett, one can use the following list used in seaborn or create [their own custom matplotlib palette](https://matplotlib.org/stable/gallery/color/custom_cmap.html). Supported values are + +``` +'Accent', 'Accent_r', 'Blues', 'Blues_r', 'BrBG', 'BrBG_r', 'BuGn', 'BuGn_r', 'BuPu', 'BuPu_r', 'CMRmap', 'CMRmap_r', 'Dark2', 'Dark2_r', 'GnBu', 'GnBu_r', 'Greens', 'Greens_r', 'Greys', 'Greys_r', 'OrRd', 'OrRd_r', 'Oranges', 'Oranges_r', 'PRGn', 'PRGn_r', 'Paired', 'Paired_r', 'Pastel1', 'Pastel1_r', 'Pastel2', 'Pastel2_r', 'PiYG', 'PiYG_r', 'PuBu', 'PuBuGn', 'PuBuGn_r', 'PuBu_r', 'PuOr', 'PuOr_r', 'PuRd', 'PuRd_r', 'Purples', 'Purples_r', 'RdBu', 'RdBu_r', 'RdGy', 'RdGy_r', 'RdPu', 'RdPu_r', 'RdYlBu', 'RdYlBu_r', 'RdYlGn', 'RdYlGn_r', 'Reds', 'Reds_r', 'Set1', 'Set1_r', 'Set2', 'Set2_r', 'Set3', 'Set3_r', 'Spectral', 'Spectral_r', 'Wistia', 'Wistia_r', 'YlGn', 'YlGnBu', 'YlGnBu_r', 'YlGn_r', 'YlOrBr', 'YlOrBr_r', 'YlOrRd', 'YlOrRd_r', 'afmhot', 'afmhot_r', 'autumn', 'autumn_r', 'binary', 'binary_r', 'bone', 'bone_r', 'brg', 'brg_r', 'bwr', 'bwr_r', 'cividis', 'cividis_r', 'cool', 'cool_r', 'coolwarm', 'coolwarm_r', 'copper', 'copper_r', 'crest', 'crest_r', 'cubehelix', 'cubehelix_r', 'flag', 'flag_r', 'flare', 'flare_r', 'gist_earth', 'gist_earth_r', 'gist_gray', 'gist_gray_r', 'gist_heat', 'gist_heat_r', 'gist_ncar', 'gist_ncar_r', 'gist_rainbow', 'gist_rainbow_r', 'gist_stern', 'gist_stern_r', 'gist_yarg', 'gist_yarg_r', 'gnuplot', 'gnuplot2', 'gnuplot2_r', 'gnuplot_r', 'gray', 'gray_r', 'hot', 'hot_r', 'hsv', 'hsv_r', 'icefire', 'icefire_r', 'inferno', 'inferno_r', 'jet', 'jet_r', 'magma', 'magma_r', 'mako', 'mako_r', 'nipy_spectral', 'nipy_spectral_r', 'ocean', 'ocean_r', 'pink', 'pink_r', 'plasma', 'plasma_r', 'prism', 'prism_r', 'rainbow', 'rainbow_r', 'rocket', 'rocket_r', 'seismic', 'seismic_r', 'spring', 'spring_r', 'summer', 'summer_r', 'tab10', 'tab10_r', 'tab20', 'tab20_r', 'tab20b', 'tab20b_r', 'tab20c', 'tab20c_r', 'terrain', 'terrain_r', 'turbo', 'turbo_r', 'twilight', 'twilight_r', 'twilight_shifted', 'twilight_shifted_r', 'viridis', 'viridis_r', 'vlag', 'vlag_r', 'winter', 'winter_r' +``` + +An example can be: + +.. code-block:: python + + from pandas_profiling import ProfileReport + + profile = ProfileReport(df, title='Pandas Profiling Report', explorative=True, + plot={ + 'correlation':{ + 'cmap': 'RdBu_r', + 'bad': '#000000'}} + ) + + +Similarly, one can change the palette for *Missing values* using the ``missing`` argument, eg: + +.. code-block:: python + + from pandas_profiling import ProfileReport + + profile = ProfileReport(df, title='Pandas Profiling Report', explorative=True, + plot={ + 'missing':{ + 'cmap': 'RdBu_r'}} + ) + + + From 6d2a418eba03eebfb1383f476dcd33860d124914 Mon Sep 17 00:00:00 2001 From: sbrugman Date: Wed, 7 Apr 2021 21:17:49 +0200 Subject: [PATCH 28/40] docs: benchmarks --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 5dda09329..c35f10270 100644 --- a/README.md +++ b/README.md @@ -220,6 +220,8 @@ profile = ProfileReport(large_dataset, minimal=True) profile.to_file("output.html") ``` +Benchmarks are available [here](https://pandas-profiling.github.io/pandas-profiling/dev/bench/). + ### Command line usage For standard formatted CSV files that can be read immediately by pandas, you can use the `pandas_profiling` executable. From ab5cd93b8fd243c4dfb77c5851102b3e5e83f911 Mon Sep 17 00:00:00 2001 From: sbrugman Date: Fri, 2 Apr 2021 15:03:24 +0200 Subject: [PATCH 29/40] refactor: Monotonicity formatter --- .../model/summary_algorithms.py | 10 +++++++ src/pandas_profiling/report/formatters.py | 16 +++++++++++ .../report/structure/variables/render_real.py | 17 ++++-------- tests/unit/test_formatters.py | 27 +++++++++++++++++++ 4 files changed, 58 insertions(+), 12 deletions(-) diff --git a/src/pandas_profiling/model/summary_algorithms.py b/src/pandas_profiling/model/summary_algorithms.py index 768829736..201a59379 100644 --- a/src/pandas_profiling/model/summary_algorithms.py +++ b/src/pandas_profiling/model/summary_algorithms.py @@ -233,6 +233,16 @@ def describe_numeric_1d(series: pd.Series, summary: dict) -> Tuple[pd.Series, di stats["monotonic_decrease_strict"] = ( stats["monotonic_decrease"] and series.is_unique ) + if summary["monotonic_increase_strict"]: + stats["monotonic"] = 2 + elif summary["monotonic_decrease_strict"]: + stats["monotonic"] = -2 + elif summary["monotonic_increase"]: + stats["monotonic"] = 1 + elif summary["monotonic_decrease"]: + stats["monotonic"] = -1 + else: + stats["monotonic"] = 0 stats.update( histogram_compute( diff --git a/src/pandas_profiling/report/formatters.py b/src/pandas_profiling/report/formatters.py index a4fe77ae0..67091d447 100644 --- a/src/pandas_profiling/report/formatters.py +++ b/src/pandas_profiling/report/formatters.py @@ -257,6 +257,21 @@ def fmt(value) -> str: return str(escape(value)) +def fmt_monotonic(value: int) -> str: + if value == 2: + return "Strictly increasing" + elif value == 1: + return "Increasing" + elif value == 0: + return "Not monotonic" + elif value == -1: + return "Decreasing" + elif value == -2: + return "Strictly decreasing" + else: + raise ValueError("Value should be integer ranging from -2 to 2.") + + def help(title, url=None) -> str: """Creat help badge @@ -283,6 +298,7 @@ def get_fmt_mapping() -> Dict[str, Callable]: "fmt_bytesize": fmt_bytesize, "fmt_timespan": fmt_timespan, "fmt_numeric": fmt_numeric, + "fmt_monotonic": fmt_monotonic, "fmt_number": fmt_number, "fmt_array": fmt_array, "fmt": fmt, diff --git a/src/pandas_profiling/report/structure/variables/render_real.py b/src/pandas_profiling/report/structure/variables/render_real.py index e7ce82412..6624548d5 100644 --- a/src/pandas_profiling/report/structure/variables/render_real.py +++ b/src/pandas_profiling/report/structure/variables/render_real.py @@ -152,17 +152,6 @@ def render_real(summary): name="Quantile statistics", ) - if summary["monotonic_increase_strict"]: - monotocity = "Strictly increasing" - elif summary["monotonic_decrease_strict"]: - monotocity = "Strictly decreasing" - elif summary["monotonic_increase"]: - monotocity = "Increasing" - elif summary["monotonic_decrease"]: - monotocity = "Decreasing" - else: - monotocity = "Not monotonic" - descriptive_statistics = Table( [ { @@ -190,7 +179,11 @@ def render_real(summary): }, {"name": "Sum", "value": summary["sum"], "fmt": "fmt_numeric"}, {"name": "Variance", "value": summary["variance"], "fmt": "fmt_numeric"}, - {"name": "Monotocity", "value": monotocity, "fmt": "fmt"}, + { + "name": "Monotonicity", + "value": summary["monotonic"], + "fmt": "fmt_monotonic", + }, ], name="Descriptive statistics", ) diff --git a/tests/unit/test_formatters.py b/tests/unit/test_formatters.py index 2e103e53d..4f6f46faf 100644 --- a/tests/unit/test_formatters.py +++ b/tests/unit/test_formatters.py @@ -6,6 +6,7 @@ fmt_bytesize, fmt_class, fmt_color, + fmt_monotonic, fmt_numeric, ) @@ -86,3 +87,29 @@ def test_fmt_array(array, threshold, expected): ) def test_fmt_numeric(value, precision, expected): assert fmt_numeric(value, precision) == expected + + +@pytest.mark.parametrize( + "value, expected", + [ + (-2, "Strictly decreasing"), + (-1, "Decreasing"), + (0, "Not monotonic"), + (1, "Increasing"), + (2, "Strictly increasing"), + ], +) +def test_fmt_monotonic(value, expected): + assert fmt_monotonic(value) == expected + + +@pytest.mark.parametrize( + "value", + [ + -3, + 3, + ], +) +def test_fmt_monotonic_err(value): + with pytest.raises(ValueError): + fmt_monotonic(value) From b42b22ad827e9e0caba4f84315e96da9a87b52e0 Mon Sep 17 00:00:00 2001 From: sbrugman Date: Wed, 7 Apr 2021 21:36:15 +0200 Subject: [PATCH 30/40] ci: benchmark increase min rounds to 10 --- .github/workflows/benchmark.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml index 94ae74ae5..d92839944 100644 --- a/.github/workflows/benchmark.yml +++ b/.github/workflows/benchmark.yml @@ -28,7 +28,7 @@ jobs: pip install -r requirements.txt pip install -r requirements-test.txt - run: make install - - run: pytest tests/benchmarks/bench.py --benchmark-json benchmark.json + - run: pytest tests/benchmarks/bench.py --benchmark-min-rounds 10 --benchmark-warmup "on" --benchmark-json benchmark.json - name: Store benchmark result uses: rhysd/github-action-benchmark@v1 with: From 0a2d1dbf866b3491105e36d798ec8b089c89895e Mon Sep 17 00:00:00 2001 From: Simon Brugman Date: Wed, 7 Apr 2021 23:39:29 +0200 Subject: [PATCH 31/40] perf: performance improvements (#755) * perf: do not infer dtypes in minimal mode * perf: reuse duplicate row statistics and exclude in minimal mode * perf: take top-n values in categorical histograms * perf: reuse sorted values for frequency tables * fix: unused imports --- README.md | 2 +- src/pandas_profiling/config_default.yaml | 1 + src/pandas_profiling/config_minimal.yaml | 3 +- src/pandas_profiling/model/describe.py | 3 +- src/pandas_profiling/model/duplicates.py | 49 ++++++++---- src/pandas_profiling/model/messages.py | 2 +- src/pandas_profiling/model/summary.py | 40 +--------- .../model/summary_algorithms.py | 8 +- src/pandas_profiling/report/formatters.py | 2 +- .../presentation/frequency_table_utils.py | 52 ++++++------- .../report/structure/overview.py | 76 +++++++++++-------- .../structure/variables/render_common.py | 8 +- tests/issues/test_issue51.py | 3 - tests/unit/test_custom_sample.py | 2 - .../test_duplicates.py} | 11 ++- tests/unit/test_interactions.py | 2 - tests/unit/test_summary.py | 2 - 17 files changed, 128 insertions(+), 138 deletions(-) rename tests/{issues/test_issue725.py => unit/test_duplicates.py} (67%) diff --git a/README.md b/README.md index c35f10270..dfc8578ee 100644 --- a/README.md +++ b/README.md @@ -211,7 +211,7 @@ profile.to_file("your_report.json") Version 2.4 introduces minimal mode. -This is a default configuration that disables expensive computations (such as correlations and dynamic binning). +This is a default configuration that disables expensive computations (such as correlations and duplicate row detection). Use the following syntax: diff --git a/src/pandas_profiling/config_default.yaml b/src/pandas_profiling/config_default.yaml index ad40d6039..fa7009003 100644 --- a/src/pandas_profiling/config_default.yaml +++ b/src/pandas_profiling/config_default.yaml @@ -48,6 +48,7 @@ vars: chi_squared_threshold: 0.999 coerce_str_to_date: False redact: False + histogram_largest: 50 bool: n_obs: 3 # string to boolean mappings pairs (true, false) diff --git a/src/pandas_profiling/config_minimal.yaml b/src/pandas_profiling/config_minimal.yaml index 1147fd680..16076c90f 100644 --- a/src/pandas_profiling/config_minimal.yaml +++ b/src/pandas_profiling/config_minimal.yaml @@ -14,7 +14,7 @@ variables: descriptions: {} # infer dtypes -infer_dtypes: True +infer_dtypes: False # Show the description at each variable (in addition to the overview tab) show_variable_description: True @@ -48,6 +48,7 @@ vars: chi_squared_threshold: 0.0 coerce_str_to_date: False redact: False + histogram_largest: 10 bool: n_obs: 3 # string to boolean mappings pairs (true, false) diff --git a/src/pandas_profiling/model/describe.py b/src/pandas_profiling/model/describe.py index a9deb0d48..22d8b1f39 100644 --- a/src/pandas_profiling/model/describe.py +++ b/src/pandas_profiling/model/describe.py @@ -131,7 +131,8 @@ def describe( # Duplicates pbar.set_postfix_str("Locating duplicates") - duplicates = get_duplicates(df, supported_columns) + metrics, duplicates = get_duplicates(df, supported_columns) + table_stats.update(metrics) pbar.update() # Messages diff --git a/src/pandas_profiling/model/duplicates.py b/src/pandas_profiling/model/duplicates.py index 72ff83a94..90b1ad0fe 100644 --- a/src/pandas_profiling/model/duplicates.py +++ b/src/pandas_profiling/model/duplicates.py @@ -1,11 +1,13 @@ -from typing import Optional +from typing import Any, Dict, Optional, Tuple import pandas as pd from pandas_profiling.config import config -def get_duplicates(df: pd.DataFrame, supported_columns) -> Optional[pd.DataFrame]: +def get_duplicates( + df: pd.DataFrame, supported_columns +) -> Tuple[Dict[str, Any], Optional[pd.DataFrame]]: """Obtain the most occurring duplicate rows in the DataFrame. Args: @@ -17,19 +19,34 @@ def get_duplicates(df: pd.DataFrame, supported_columns) -> Optional[pd.DataFrame """ n_head = config["duplicates"]["head"].get(int) - if n_head > 0 and supported_columns: - duplicates_key = config["duplicates"]["key"].get(str) - if duplicates_key in df.columns: - raise ValueError( - f"Duplicates key ({duplicates_key}) may not be part of the DataFrame. Either change the " - f" column name in the DataFrame or change the 'duplicates.key' parameter." + metrics: Dict[str, Any] = {} + if n_head > 0: + if supported_columns and len(df) > 0: + duplicates_key = config["duplicates"]["key"].get(str) + if duplicates_key in df.columns: + raise ValueError( + f"Duplicates key ({duplicates_key}) may not be part of the DataFrame. Either change the " + f" column name in the DataFrame or change the 'duplicates.key' parameter." + ) + + duplicated_rows = df.duplicated(subset=supported_columns, keep=False) + duplicated_rows = ( + df[duplicated_rows] + .groupby(supported_columns) + .size() + .reset_index(name=duplicates_key) ) - return ( - df[df.duplicated(subset=supported_columns, keep=False)] - .groupby(supported_columns) - .size() - .reset_index(name=duplicates_key) - .nlargest(n_head, duplicates_key) - ) - return None + metrics["n_duplicates"] = len(duplicated_rows[duplicates_key]) + metrics["p_duplicates"] = metrics["n_duplicates"] / len(df) + + return ( + metrics, + duplicated_rows.nlargest(n_head, duplicates_key), + ) + else: + metrics["n_duplicates"] = 0 + metrics["p_duplicates"] = 0.0 + return metrics, None + else: + return metrics, None diff --git a/src/pandas_profiling/model/messages.py b/src/pandas_profiling/model/messages.py index 7f41dc865..3e557cc63 100644 --- a/src/pandas_profiling/model/messages.py +++ b/src/pandas_profiling/model/messages.py @@ -112,7 +112,7 @@ def check_table_messages(table: dict) -> List[Message]: A list of messages. """ messages = [] - if warning_value(table["n_duplicates"]): + if "n_duplicates" in table and warning_value(table["n_duplicates"]): messages.append( Message( message_type=MessageType.DUPLICATES, diff --git a/src/pandas_profiling/model/summary.py b/src/pandas_profiling/model/summary.py index b5c2191a2..ea14eae7e 100644 --- a/src/pandas_profiling/model/summary.py +++ b/src/pandas_profiling/model/summary.py @@ -4,7 +4,7 @@ import multiprocessing.pool import warnings from collections import Counter -from typing import Callable, Mapping, Optional, Tuple +from typing import Callable, Mapping, Tuple import numpy as np import pandas as pd @@ -16,7 +16,6 @@ check_variable_messages, ) from pandas_profiling.model.summarizer import BaseSummarizer -from pandas_profiling.model.typeset import Unsupported from pandas_profiling.visualisation.missing import ( missing_bar, missing_dendrogram, @@ -149,20 +148,6 @@ def get_table_stats(df: pd.DataFrame, variable_stats: dict) -> dict: else 0 ) - supported_columns = [ - k for k, v in variable_stats.items() if v["type"] != Unsupported - ] - table_stats["n_duplicates"] = ( - sum(df.duplicated(subset=supported_columns)) - if len(supported_columns) > 0 - else 0 - ) - table_stats["p_duplicates"] = ( - (table_stats["n_duplicates"] / len(df)) - if (len(supported_columns) > 0 and len(df) > 0) - else 0 - ) - # Variable type counts table_stats.update( {"types": dict(Counter([v["type"] for v in variable_stats.values()]))} @@ -171,29 +156,6 @@ def get_table_stats(df: pd.DataFrame, variable_stats: dict) -> dict: return table_stats -def get_duplicates(df: pd.DataFrame, supported_columns) -> Optional[pd.DataFrame]: - """Obtain the most occurring duplicate rows in the DataFrame. - - Args: - df: the Pandas DataFrame. - supported_columns: the columns to consider - - Returns: - A subset of the DataFrame, ordered by occurrence. - """ - n_head = config["duplicates"]["head"].get(int) - - if n_head > 0 and supported_columns: - return ( - df[df.duplicated(subset=supported_columns, keep=False)] - .groupby(supported_columns) - .size() - .reset_index(name="count") - .nlargest(n_head, "count") - ) - return None - - def get_missing_diagrams(df: pd.DataFrame, table_stats: dict) -> dict: """Gets the rendered diagrams for missing values. diff --git a/src/pandas_profiling/model/summary_algorithms.py b/src/pandas_profiling/model/summary_algorithms.py index 201a59379..9a95f6be3 100644 --- a/src/pandas_profiling/model/summary_algorithms.py +++ b/src/pandas_profiling/model/summary_algorithms.py @@ -305,10 +305,16 @@ def describe_categorical_1d(series: pd.Series, summary: dict) -> Tuple[pd.Series # Only run if at least 1 non-missing value value_counts = summary["value_counts_without_nan"] + histogram_largest = config["vars"]["cat"]["histogram_largest"].get(int) + histogram_data = value_counts + if histogram_largest > 0: + histogram_data = histogram_data.nlargest(histogram_largest) summary.update( histogram_compute( - value_counts, summary["n_distinct"], name="histogram_frequencies" + histogram_data, + summary["n_distinct"], + name="histogram_frequencies", ) ) diff --git a/src/pandas_profiling/report/formatters.py b/src/pandas_profiling/report/formatters.py index 67091d447..558f4fa4e 100644 --- a/src/pandas_profiling/report/formatters.py +++ b/src/pandas_profiling/report/formatters.py @@ -78,7 +78,7 @@ def fmt_timespan(num_seconds, detailed=False, max_units=3): import math import numbers import re - from datetime import datetime, timedelta + from datetime import timedelta time_units = ( dict( diff --git a/src/pandas_profiling/report/presentation/frequency_table_utils.py b/src/pandas_profiling/report/presentation/frequency_table_utils.py index bb53e1dae..0862a19b8 100644 --- a/src/pandas_profiling/report/presentation/frequency_table_utils.py +++ b/src/pandas_profiling/report/presentation/frequency_table_utils.py @@ -1,7 +1,9 @@ -from typing import Dict, Sequence +from typing import Any, Dict, List +import numpy as np -def freq_table(freqtable, n: int, max_number_to_print: int) -> Sequence[Dict]: + +def freq_table(freqtable, n: int, max_number_to_print: int) -> List[Dict]: """Render the rows for a frequency table (value, count). Args: @@ -19,13 +21,13 @@ def freq_table(freqtable, n: int, max_number_to_print: int) -> Sequence[Dict]: max_number_to_print = n if max_number_to_print < len(freqtable): - freq_other = sum(freqtable.iloc[max_number_to_print:]) + freq_other = np.sum(freqtable.iloc[max_number_to_print:]) min_freq = freqtable.values[max_number_to_print] else: freq_other = 0 min_freq = 0 - freq_missing = n - sum(freqtable) + freq_missing = n - np.sum(freqtable) # No values if len(freqtable) == 0: return [] @@ -79,39 +81,37 @@ def freq_table(freqtable, n: int, max_number_to_print: int) -> Sequence[Dict]: return rows -def extreme_obs_table(freqtable, number_to_print, n, ascending=True) -> list: +def extreme_obs_table(freqtable, number_to_print: int, n: int) -> List[Dict[str, Any]]: """Similar to the frequency table, for extreme observations. Args: - freqtable: The frequency table. + freqtable: The (sorted) frequency table. number_to_print: The number of observations to print. n: The total number of observations. - ascending: The ordering of the observations (Default value = True) Returns: The HTML rendering of the extreme observation table. """ + # If it's mixed between base types (str, int) convert to str. Pure "mixed" types are filtered during type # discovery # TODO: should be in cast? - if "mixed" in freqtable.index.inferred_type: - freqtable.index = freqtable.index.astype(str) - - sorted_freqtable = freqtable.sort_index(ascending=ascending) - obs_to_print = sorted_freqtable.iloc[:number_to_print] - max_freq = max(obs_to_print.values) - - rows = [] - for label, freq in obs_to_print.items(): - rows.append( - { - "label": label, - "width": freq / max_freq if max_freq != 0 else 0, - "count": freq, - "percentage": float(freq) / n, - "extra_class": "", - "n": n, - } - ) + # if "mixed" in freqtable.index.inferred_type: + # freqtable.index = freqtable.index.astype(str) + + obs_to_print = freqtable.iloc[:number_to_print] + max_freq = obs_to_print.max() + + rows = [ + { + "label": label, + "width": freq / max_freq if max_freq != 0 else 0, + "count": freq, + "percentage": float(freq) / n, + "extra_class": "", + "n": n, + } + for label, freq in obs_to_print.items() + ] return rows diff --git a/src/pandas_profiling/report/structure/overview.py b/src/pandas_profiling/report/structure/overview.py index e8751086a..2ab3bcb60 100644 --- a/src/pandas_profiling/report/structure/overview.py +++ b/src/pandas_profiling/report/structure/overview.py @@ -7,38 +7,46 @@ def get_dataset_overview(summary): - dataset_info = Table( + table_metrics = [ + { + "name": "Number of variables", + "value": summary["table"]["n_var"], + "fmt": "fmt_number", + }, + { + "name": "Number of observations", + "value": summary["table"]["n"], + "fmt": "fmt_number", + }, + { + "name": "Missing cells", + "value": summary["table"]["n_cells_missing"], + "fmt": "fmt_number", + }, + { + "name": "Missing cells (%)", + "value": summary["table"]["p_cells_missing"], + "fmt": "fmt_percent", + }, + ] + if "n_duplicates" in summary["table"]: + table_metrics.extend( + [ + { + "name": "Duplicate rows", + "value": summary["table"]["n_duplicates"], + "fmt": "fmt_number", + }, + { + "name": "Duplicate rows (%)", + "value": summary["table"]["p_duplicates"], + "fmt": "fmt_percent", + }, + ] + ) + + table_metrics.extend( [ - { - "name": "Number of variables", - "value": summary["table"]["n_var"], - "fmt": "fmt_number", - }, - { - "name": "Number of observations", - "value": summary["table"]["n"], - "fmt": "fmt_number", - }, - { - "name": "Missing cells", - "value": summary["table"]["n_cells_missing"], - "fmt": "fmt_number", - }, - { - "name": "Missing cells (%)", - "value": summary["table"]["p_cells_missing"], - "fmt": "fmt_percent", - }, - { - "name": "Duplicate rows", - "value": summary["table"]["n_duplicates"], - "fmt": "fmt_number", - }, - { - "name": "Duplicate rows (%)", - "value": summary["table"]["p_duplicates"], - "fmt": "fmt_percent", - }, { "name": "Total size in memory", "value": summary["table"]["memory_size"], @@ -49,7 +57,11 @@ def get_dataset_overview(summary): "value": summary["table"]["record_size"], "fmt": "fmt_bytesize", }, - ], + ] + ) + + dataset_info = Table( + table_metrics, name="Dataset statistics", ) diff --git a/src/pandas_profiling/report/structure/variables/render_common.py b/src/pandas_profiling/report/structure/variables/render_common.py index 426f258b1..e55d29536 100644 --- a/src/pandas_profiling/report/structure/variables/render_common.py +++ b/src/pandas_profiling/report/structure/variables/render_common.py @@ -9,6 +9,8 @@ def render_common(summary): n_extreme_obs = config["n_extreme_obs"].get(int) n_freq_table_max = config["n_freq_table_max"].get(int) + sorted_freqtable = summary["value_counts_without_nan"].sort_index(ascending=True) + template_variables = { # TODO: with nan "freq_table_rows": freq_table( @@ -17,16 +19,14 @@ def render_common(summary): max_number_to_print=n_freq_table_max, ), "firstn_expanded": extreme_obs_table( - freqtable=summary["value_counts_without_nan"], + freqtable=sorted_freqtable, number_to_print=n_extreme_obs, n=summary["n"], - ascending=True, ), "lastn_expanded": extreme_obs_table( - freqtable=summary["value_counts_without_nan"], + freqtable=sorted_freqtable[::-1], number_to_print=n_extreme_obs, n=summary["n"], - ascending=False, ), } diff --git a/tests/issues/test_issue51.py b/tests/issues/test_issue51.py index 50617ca81..71815f23e 100644 --- a/tests/issues/test_issue51.py +++ b/tests/issues/test_issue51.py @@ -7,9 +7,6 @@ import pandas_profiling -# FIXME: correlations can be computed stand alone to speed up tests -from pandas_profiling.config import config - def test_issue51(get_data_file): # Categorical has empty ('') value diff --git a/tests/unit/test_custom_sample.py b/tests/unit/test_custom_sample.py index 81a0bd551..4aab90280 100644 --- a/tests/unit/test_custom_sample.py +++ b/tests/unit/test_custom_sample.py @@ -1,5 +1,3 @@ -from pathlib import Path - import pandas as pd from pandas_profiling import ProfileReport diff --git a/tests/issues/test_issue725.py b/tests/unit/test_duplicates.py similarity index 67% rename from tests/issues/test_issue725.py rename to tests/unit/test_duplicates.py index c46e37d67..b3043edce 100644 --- a/tests/issues/test_issue725.py +++ b/tests/unit/test_duplicates.py @@ -1,7 +1,4 @@ -""" -Test for issue 725: -https://github.com/pandas-profiling/pandas-profiling/issues/725 -""" +"""Test for the duplicates functionality""" import numpy as np import pandas as pd import pytest @@ -21,11 +18,13 @@ def test_data(): def test_issue725(test_data): - duplicates = get_duplicates(test_data, list(test_data.columns)) + metrics, duplicates = get_duplicates(test_data, list(test_data.columns)) + assert metrics["n_duplicates"] == 100 + assert metrics["p_duplicates"] == 0.5 assert set(duplicates.columns) == set(test_data.columns).union({"# duplicates"}) def test_issue725_existing(test_data): test_data = test_data.rename(columns={"count": "# duplicates"}) with pytest.raises(ValueError): - _ = get_duplicates(test_data, list(test_data.columns)) + _, _ = get_duplicates(test_data, list(test_data.columns)) diff --git a/tests/unit/test_interactions.py b/tests/unit/test_interactions.py index ac658bb99..e02dac497 100644 --- a/tests/unit/test_interactions.py +++ b/tests/unit/test_interactions.py @@ -1,5 +1,3 @@ -from pathlib import Path - import numpy as np import pandas as pd diff --git a/tests/unit/test_summary.py b/tests/unit/test_summary.py index bf25b3cf8..f44de68b0 100644 --- a/tests/unit/test_summary.py +++ b/tests/unit/test_summary.py @@ -8,5 +8,3 @@ def test_get_table_stats_empty_df(): table_stats = get_table_stats(df, {}) assert table_stats["n"] == 0 assert table_stats["p_cells_missing"] == 0 - assert table_stats["n_duplicates"] == 0 - assert table_stats["p_duplicates"] == 0 From 94acd76f751a239d10613f2616c6785cf4d1a521 Mon Sep 17 00:00:00 2001 From: "dependabot-preview[bot]" <27856297+dependabot-preview[bot]@users.noreply.github.com> Date: Thu, 8 Apr 2021 13:29:13 +0200 Subject: [PATCH 32/40] build(deps): update pytest-benchmark requirement from ~=3.2.2 to ~=3.2.3 (#757) Updates the requirements on [pytest-benchmark](https://github.com/ionelmc/pytest-benchmark) to permit the latest version. - [Release notes](https://github.com/ionelmc/pytest-benchmark/releases) - [Changelog](https://github.com/ionelmc/pytest-benchmark/blob/master/CHANGELOG.rst) - [Commits](https://github.com/ionelmc/pytest-benchmark/compare/v3.2.2...v3.2.3) --- requirements-test.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements-test.txt b/requirements-test.txt index 137ae3666..e5bb6eb77 100644 --- a/requirements-test.txt +++ b/requirements-test.txt @@ -3,7 +3,7 @@ coverage<5 codecov pytest-mypy pytest-cov -pytest-benchmark~=3.2.2 +pytest-benchmark~=3.2.3 nbval pyarrow flake8 From e91019fc82acfac96720398f326ace8bb2cfdd8a Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Mon, 12 Apr 2021 17:02:59 +0000 Subject: [PATCH 33/40] [pre-commit.ci] pre-commit autoupdate --- .pre-commit-config.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index ff93d66d7..f98639b6f 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -17,7 +17,7 @@ repos: additional_dependencies: [ pyupgrade==2.7.3 ] args: [ --nbqa-mutate, --py36-plus ] - repo: https://github.com/asottile/pyupgrade - rev: v2.11.0 + rev: v2.12.0 hooks: - id: pyupgrade args: ['--py36-plus','--exit-zero-even-if-changed'] From 7520bd73849ddbcf2c597b887204f08be85ff5cc Mon Sep 17 00:00:00 2001 From: sbrugman Date: Fri, 16 Apr 2021 15:02:36 +0200 Subject: [PATCH 34/40] fix: banking example dataset's link dead, replaced with original source --- examples/bank_marketing_data/banking_data.py | 6 +-- src/pandas_profiling/utils/cache.py | 41 ++++++++++++++++++-- tests/issues/test_issue377.py | 6 +-- 3 files changed, 43 insertions(+), 10 deletions(-) diff --git a/examples/bank_marketing_data/banking_data.py b/examples/bank_marketing_data/banking_data.py index 9d5eb285c..139c5e964 100644 --- a/examples/bank_marketing_data/banking_data.py +++ b/examples/bank_marketing_data/banking_data.py @@ -5,12 +5,12 @@ import pandas as pd from pandas_profiling import ProfileReport -from pandas_profiling.utils.cache import cache_file +from pandas_profiling.utils.cache import cache_zipped_file if __name__ == "__main__": - file_name = cache_file( + file_name = cache_zipped_file( "bank-full.csv", - "https://storage.googleapis.com/erwinh-public-data/bankingdata/bank-full.csv", + "https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip", ) # Download the UCI Bank Marketing Dataset diff --git a/src/pandas_profiling/utils/cache.py b/src/pandas_profiling/utils/cache.py index 1699b2c22..07e67f04d 100644 --- a/src/pandas_profiling/utils/cache.py +++ b/src/pandas_profiling/utils/cache.py @@ -1,4 +1,5 @@ """Dataset cache utility functions""" +import zipfile from pathlib import Path import requests @@ -20,9 +21,41 @@ def cache_file(file_name: str, url: str) -> Path: data_path = get_data_path() data_path.mkdir(exist_ok=True) + file_path = data_path / file_name + # If not exists, download and create file - if not (data_path / file_name).exists(): - data = requests.get(url) - (data_path / file_name).write_bytes(data.content) + if not file_path.exists(): + response = requests.get(url) + file_path.write_bytes(response.content) + + return file_path + + +def cache_zipped_file(file_name: str, url: str) -> Path: + """Check if file_name already is in the data path, otherwise download it from url. + + Args: + file_name: the file name + url: the URL of the dataset + + Returns: + The relative path to the dataset + """ + + data_path = get_data_path() + data_path.mkdir(exist_ok=True) + + file_path = data_path / file_name + + # If not exists, download and create file + if not file_path.exists(): + response = requests.get(url) + tmp_path = data_path / "tmp.zip" + tmp_path.write_bytes(response.content) + + with zipfile.ZipFile(tmp_path, "r") as zip_file: + zip_file.extract(file_path.name, data_path) + + tmp_path.unlink() - return data_path / file_name + return file_path diff --git a/tests/issues/test_issue377.py b/tests/issues/test_issue377.py index 2ffa39a92..1e03e6efd 100644 --- a/tests/issues/test_issue377.py +++ b/tests/issues/test_issue377.py @@ -8,14 +8,14 @@ import pytest import pandas_profiling -from pandas_profiling.utils.cache import cache_file +from pandas_profiling.utils.cache import cache_zipped_file @pytest.mark.skipif(sys.version_info < (3, 6), reason="requires python3.6 or higher") def test_issue377(): - file_name = cache_file( + file_name = cache_zipped_file( "bank-full.csv", - "https://storage.googleapis.com/erwinh-public-data/bankingdata/bank-full.csv", + "https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip", ) # Download the UCI Bank Marketing Dataset From 01bba41db88dc80dbc2fe83524793c18dcabbfcf Mon Sep 17 00:00:00 2001 From: sbrugman Date: Fri, 16 Apr 2021 16:17:43 +0200 Subject: [PATCH 35/40] ci: commitlint conventional commits --- .github/workflows/commit.yml | 11 +++++++++++ 1 file changed, 11 insertions(+) create mode 100644 .github/workflows/commit.yml diff --git a/.github/workflows/commit.yml b/.github/workflows/commit.yml new file mode 100644 index 000000000..818987e0f --- /dev/null +++ b/.github/workflows/commit.yml @@ -0,0 +1,11 @@ +name: Lint Commit Messages +on: [pull_request] + +jobs: + commitlint: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v2 + with: + fetch-depth: 0 + - uses: wagoid/commitlint-github-action@v3 \ No newline at end of file From 7592a082d403ab0df37ee4c2f95bd6a6623a08cb Mon Sep 17 00:00:00 2001 From: sbrugman Date: Wed, 7 Apr 2021 20:43:07 +0200 Subject: [PATCH 36/40] feat: add RDW example --- README.md | 1 + examples/rdw/rdw.py | 14 ++++++++++++++ 2 files changed, 15 insertions(+) create mode 100644 examples/rdw/rdw.py diff --git a/README.md b/README.md index dfc8578ee..de9e02497 100644 --- a/README.md +++ b/README.md @@ -79,6 +79,7 @@ The following examples can give you an impression of what the package can do: * [Vektis](https://pandas-profiling.github.io/pandas-profiling/examples/master/vektis/vektis_report.html) (Vektis Dutch Healthcare data) * [Colors](https://pandas-profiling.github.io/pandas-profiling/examples/master/colors/colors_report.html) (a simple colors dataset) * [UCI Bank Dataset](https://pandas-profiling.github.io/pandas-profiling/examples/master/cbank_marketing_data/uci_bank_marketing_report.html) (banking marketing dataset) +* [RDW](https://pandas-profiling.github.io/pandas-profiling/examples/master/rdw/rdw.html) (RDW, the Dutch DMV's vehicle registration 10 million rows, 71 features) Specific features: diff --git a/examples/rdw/rdw.py b/examples/rdw/rdw.py new file mode 100644 index 000000000..3c500882c --- /dev/null +++ b/examples/rdw/rdw.py @@ -0,0 +1,14 @@ +import pandas as pd + +from pandas_profiling import ProfileReport +from pandas_profiling.utils.cache import cache_file + +if __name__ == "__main__": + file_name = cache_file( + "rdw.parquet", + "https://raw.githubusercontent.com/pandas-profiling/pandas-profiling-data/master/data/rdw.parquet", + ) + data = pd.read_parquet(file_name) + + profile = ProfileReport(data, title="RDW Dataset", minimal=True) + profile.to_file("rdw.html") From 4d676361e7b164e1d192ed5ffb87223ec3680296 Mon Sep 17 00:00:00 2001 From: "dependabot-preview[bot]" <27856297+dependabot-preview[bot]@users.noreply.github.com> Date: Mon, 19 Apr 2021 09:00:07 +0200 Subject: [PATCH 37/40] build(deps): update pytest-benchmark requirement from ~=3.2.3 to ~=3.4.1 (#764) Updates the requirements on [pytest-benchmark](https://github.com/ionelmc/pytest-benchmark) to permit the latest version. - [Release notes](https://github.com/ionelmc/pytest-benchmark/releases) - [Changelog](https://github.com/ionelmc/pytest-benchmark/blob/master/CHANGELOG.rst) - [Commits](https://github.com/ionelmc/pytest-benchmark/compare/v3.2.3...v3.4.1) --- requirements-test.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements-test.txt b/requirements-test.txt index e5bb6eb77..e92c82343 100644 --- a/requirements-test.txt +++ b/requirements-test.txt @@ -3,7 +3,7 @@ coverage<5 codecov pytest-mypy pytest-cov -pytest-benchmark~=3.2.3 +pytest-benchmark~=3.4.1 nbval pyarrow flake8 From b91771d7995c349430ea4b115477ed3707eae49a Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Mon, 19 Apr 2021 20:56:58 +0200 Subject: [PATCH 38/40] build: pre-commit autoupdate (#765) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit build: pre-commit autoupdate (#765) - [github.com/nbQA-dev/nbQA: 0.6.0 → 0.7.0](https://github.com/nbQA-dev/nbQA/compare/0.6.0...0.7.0) - [github.com/PyCQA/flake8: 3.9.0 → 3.9.1](https://github.com/PyCQA/flake8/compare/3.9.0...3.9.1) --- .pre-commit-config.yaml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index f98639b6f..a81c863b9 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -5,7 +5,7 @@ repos: - id: black language_version: python3.8 - repo: https://github.com/nbQA-dev/nbQA - rev: 0.6.0 + rev: 0.7.0 hooks: - id: nbqa-black additional_dependencies: [ black==20.8b1 ] @@ -32,7 +32,7 @@ repos: hooks: - id: check-manifest - repo: https://github.com/PyCQA/flake8 - rev: "3.9.0" + rev: "3.9.1" hooks: - id: flake8 args: [ "--select=E9,F63,F7,F82"] #,T001 From ad765be82ba4c9338f3659480abbfd68e44918dd Mon Sep 17 00:00:00 2001 From: sbrugman Date: Wed, 5 May 2021 16:51:07 +0200 Subject: [PATCH 39/40] test: skip test if dataset is unavailable CI will not be blocked if the UCI ML repository is down. --- src/pandas_profiling/utils/cache.py | 3 +++ tests/issues/test_issue377.py | 30 +++++++++++++++++++---------- 2 files changed, 23 insertions(+), 10 deletions(-) diff --git a/src/pandas_profiling/utils/cache.py b/src/pandas_profiling/utils/cache.py index 07e67f04d..356d6fea8 100644 --- a/src/pandas_profiling/utils/cache.py +++ b/src/pandas_profiling/utils/cache.py @@ -50,6 +50,9 @@ def cache_zipped_file(file_name: str, url: str) -> Path: # If not exists, download and create file if not file_path.exists(): response = requests.get(url) + if response.status_code != 200: + raise FileNotFoundError("Could not download resource") + tmp_path = data_path / "tmp.zip" tmp_path.write_bytes(response.content) diff --git a/tests/issues/test_issue377.py b/tests/issues/test_issue377.py index 1e03e6efd..3362e812e 100644 --- a/tests/issues/test_issue377.py +++ b/tests/issues/test_issue377.py @@ -6,25 +6,35 @@ import pandas as pd import pytest +import requests -import pandas_profiling +from pandas_profiling import ProfileReport from pandas_profiling.utils.cache import cache_zipped_file -@pytest.mark.skipif(sys.version_info < (3, 6), reason="requires python3.6 or higher") -def test_issue377(): - file_name = cache_zipped_file( - "bank-full.csv", - "https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip", - ) +@pytest.fixture() +def df(): + try: + file_name = cache_zipped_file( + "bank-full.csv", + "https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip", + ) + except (requests.exceptions.ConnectionError, FileNotFoundError): + return # Download the UCI Bank Marketing Dataset df = pd.read_csv(file_name, sep=";") + return df + + +@pytest.mark.skipif(sys.version_info < (3, 6), reason="requires python3.6 or higher") +def test_issue377(df): + if df is None: + pytest.skip("dataset unavailable") + return original_order = tuple(df.columns.values) - profile = pandas_profiling.ProfileReport( - df, sort="None", pool_size=1, progress_bar=False - ) + profile = ProfileReport(df, sort="None", pool_size=1, progress_bar=False) new_order = tuple(profile.get_description()["variables"].keys()) assert original_order == new_order From 1d4c9b58b132f8cd56fae0d5f57635bf675e86ca Mon Sep 17 00:00:00 2001 From: sbrugman Date: Wed, 5 May 2021 17:50:56 +0200 Subject: [PATCH 40/40] chore: update changelog --- docsrc/source/pages/changelog/v2_12_0.rst | 19 ++++++++++++++++--- docsrc/source/pages/changelog/v2_13_0.rst | 2 +- 2 files changed, 17 insertions(+), 4 deletions(-) diff --git a/docsrc/source/pages/changelog/v2_12_0.rst b/docsrc/source/pages/changelog/v2_12_0.rst index 2b6b6a5e8..02d35bb4a 100644 --- a/docsrc/source/pages/changelog/v2_12_0.rst +++ b/docsrc/source/pages/changelog/v2_12_0.rst @@ -3,14 +3,27 @@ Changelog v2.12.0 🎉 Features ^^^^^^^^^^^ -- Add the number and the percentage of negative values for numerical variables `[695] `- (contributed by @gverbock). +- Add the number and the percentage of negative values for numerical variables `[695] `_ (contributed by @gverbock) - Enable setting of typeset/summarizer (contributed by @ieaves) +- Allow empty data frames `[678] `_ (contributed by @spbail, @fwd2020-c) + +🐛 Bug fixes +^^^^^^^^^^^^ +- Patch args for great_expectations datetime profiler `[727] `_ (contributed by @jstammers) +- Negative exponent formatting `[723] `_ (reported by @rdpapworth) 📖 Documentation ^^^^^^^^^^^^^^^^ - Fix link syntax (contributed by @ChrisCarini) +👷‍♂️ Internal Improvements +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +- Several performance improvements (minimal mode, duplicates, frequency table sorting) +- Introduce ``pytest-benchmark`` in CI to monitor commit performance impact +- Introduce ``commitlint`` in CI to start automating the changelog generation + ⬆️ Dependencies ^^^^^^^^^^^^^^^^^^ -- The `ipywidgets` dependency was moved to the `[notebook]` extra, so most of Jupyter will not be installed alongside this package by default (contributed by @akx). -- Replaced the (testing only) `fastparquet` dependency with `pyarrow` (default pandas parquet engine, contributed by @kurosch). \ No newline at end of file +- The ``ipywidgets`` dependency was moved to the ``[notebook]`` extra, so most of Jupyter will not be installed alongside this package by default (contributed by @akx) +- Replaced the (testing only) ``fastparquet`` dependency with ``pyarrow`` (default pandas parquet engine, contributed by @kurosch) +- Upgrade ``phik``. This drops the hard dependency on numba (contributed by @akx) diff --git a/docsrc/source/pages/changelog/v2_13_0.rst b/docsrc/source/pages/changelog/v2_13_0.rst index 4dbf7b73a..d8b8eb1c3 100644 --- a/docsrc/source/pages/changelog/v2_13_0.rst +++ b/docsrc/source/pages/changelog/v2_13_0.rst @@ -3,7 +3,7 @@ Changelog v2.13.0 🎉 Features ^^^^^^^^^^^ -- Allow empty data frames `[678] `_ "contributed by @spbail, @fwd2020-c" +- 🐛 Bug fixes ^^^^^^^^^^^^