Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: issue#915, Error for large integers in Series #1233

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from

Conversation

Sohaib90
Copy link
Contributor

@Sohaib90 Sohaib90 commented Jan 5, 2023

This patch fixes issue #915: Error for series with large integers.

The issue is caused when using the numpy.histogram function with large integers which causes unevenly spaced bin edges.
here:

https://github.com/ydataai/pandas-profiling/blob/5b1abac48ed9ed5a9e7e662be30c913acc3e7a5b/src/pandas_profiling/model/summary_algorithms.py#L39

and here:

https://github.com/ydataai/pandas-profiling/blob/5b1abac48ed9ed5a9e7e662be30c913acc3e7a5b/src/pandas_profiling/model/summary_algorithms.py#L52

This can cause the resulting histogram to be distorted or misleading, as the bin sizes may not be uniform.

To resolve this issue, I used the numpy.histogram_bin_edges function to compute the bin edges for the data before passing them to the numpy.histogram function. This function allows to specify the number of bins and the range of the data, and will compute the bin edges in a way that ensures they are evenly spaced. This fix does not raise an error as reported in the bug report and successfully generates a report. I have also included a test_issue915.py for testing the generation of the report.

@Sohaib90 Sohaib90 changed the title Issue#915: Fixed the Error for large integers in Series fix: Issue#915, Error for large integers in Series Jan 5, 2023
@Sohaib90 Sohaib90 changed the base branch from master to develop January 20, 2023 13:54
@Sohaib90
Copy link
Contributor Author

@alexbarros you might want to look into this as well if you get the chance :)

@Sohaib90 Sohaib90 changed the title fix: Issue#915, Error for large integers in Series fix: issue#915, Error for large integers in Series Jan 24, 2023
@aquemy aquemy requested review from aquemy and alexbarros January 24, 2023 13:38
Comment on lines +39 to +40
bins = np.histogram_bin_edges(finite_values, bins=bins_arg)
stats[name] = np.histogram(finite_values, bins=bins, weights=weights)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting solution, but still seems to behave a bit weirdly with big numbers

> import numpy as np
> arr = np.array([716277643516076032 + i for i in range(100)])
> bins = np.histogram_bin_edges(arr, bins=5)
> np.histogram(arr, bins=bins)
(array([ 0,  0, 65,  0, 35]),
 array([7.16277644e+17, 7.16277644e+17, 7.16277644e+17, 7.16277644e+17,
        7.16277644e+17, 7.16277644e+17]))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, it is still not evenly distributed like it should be for smaller numbers. What do you propose here? Leaving np.histogram_bin_edges raises an error for larger numbers. Is it better to raise error than have weird behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking from a user's perspective, for me is better to have an error being raised than an incorrect plot. If I know that there was a problem with the large integers I can preprocess that column and run again, but an incorrect result may lead me to an incorrect interpretation of my data distribution.

Copy link
Contributor Author

@Sohaib90 Sohaib90 Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is what I was thinking as well. I think I should make the changes so that it leads to raising an error rather than making an incorrect plot, right?

Also there is a check Codacy Static Code Analysis that is failing. I think that is a new one

@codecov-commenter
Copy link

codecov-commenter commented Jan 24, 2023

Codecov Report

Base: 90.47% // Head: 90.49% // Increases project coverage by +0.01% 🎉

Coverage data is based on head (01c2677) compared to base (d685678).
Patch coverage: 91.30% of modified lines in pull request are covered.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1233      +/-   ##
===========================================
+ Coverage    90.47%   90.49%   +0.01%     
===========================================
  Files          185      186       +1     
  Lines         5661     5681      +20     
===========================================
+ Hits          5122     5141      +19     
- Misses         539      540       +1     
Flag Coverage Δ
py3.8-ubuntu-latest-pandas 90.49% <91.30%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/pandas_profiling/model/summary_algorithms.py 74.41% <66.66%> (-0.29%) ⬇️
tests/issues/test_issue915.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@fabclmnt fabclmnt requested a review from alexbarros January 25, 2023 03:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants