Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: drop joblib dependency #1090

Merged
merged 1 commit into from
Oct 7, 2022
Merged

fix: drop joblib dependency #1090

merged 1 commit into from
Oct 7, 2022

Conversation

akx
Copy link
Contributor

@akx akx commented Oct 4, 2022

Joblib was only used for joblib.hash for dataframes, but there's hash_pandas_object for that.

This naturally changes the hashes generated (but so could have any internal change in joblib so far) and furnishes for that by adding a 2@ (version 2) prefix to the newly-generated hashes.

Tangentially refs #1056

@akx akx changed the base branch from master to develop October 4, 2022 05:44
@codecov-commenter
Copy link

Codecov Report

Base: 90.91% // Head: 90.92% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (b187f90) compared to base (2462119).
Patch coverage: 100.00% of modified lines in pull request are covered.

❗ Current head b187f90 differs from pull request most recent head 0b115b4. Consider uploading reports for the commit 0b115b4 to get more accurate results

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #1090   +/-   ##
========================================
  Coverage    90.91%   90.92%           
========================================
  Files          174      173    -1     
  Lines         4929     4934    +5     
========================================
+ Hits          4481     4486    +5     
  Misses         448      448           
Flag Coverage Δ
py3.8-ubuntu-latest-pandas 90.92% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/pandas_profiling/utils/dataframe.py 80.48% <100.00%> (+1.00%) ⬆️
tests/unit/test_typeset_default.py 95.45% <0.00%> (-4.55%) ⬇️
tests/unit/test_utils.py 100.00% <0.00%> (ø)
tests/issues/test_issue523.py 100.00% <0.00%> (ø)
tests/issues/test_issue545.py 100.00% <0.00%> (ø)
...ndas_profiling/model/pandas/correlations_pandas.py 100.00% <0.00%> (ø)
src/pandas_profiling/utils/compat.py
src/pandas_profiling/model/typeset_relations.py 89.33% <0.00%> (+0.29%) ⬆️
src/pandas_profiling/model/correlations.py 89.65% <0.00%> (+1.13%) ⬆️
..._profiling/model/pandas/describe_numeric_pandas.py 98.38% <0.00%> (+1.46%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

# binary representation would be more efficient, but it's not
# necessarily portable across architectures. Using the human-readable
# string values should be good enough.
hash_values = "\n".join(hash_pandas_object(df).values.astype(str))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do recall that hash_pandas_object had a serious limitation: not serializable due to circular dependencies.

Is this something that remains?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't imagine so, since this is not about serialization but about generating a hash...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's about hashing indeed and I'm aware of that, but it does matter if limitation introduced given we have some requests to enable the ProfileReport serialization and the hashing is stored in a property of the Class _df_hash, if that limitation remains it is something to be taken into consideration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right – the _df_hash field is computed by just calling this function:
https://github.com/ydataai/pandas-profiling/blob/7506bca4489649f317c9df0d6da60b514808c7df/src/pandas_profiling/profile_report.py#L190-L194
so the value will be a str, which is always naturally serializable. (I see there's some additional magic involving that property in serialize_report...)

However, I eyeball a separate bug in that if the .df for a report is mutated, the hash is not invalidated – thus in fact, the property probably shouldn't have a backing _df_hash field at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the value is serializable, the problem is the product of serialization that becomes not usable. Hence my question.

Nevertheless, I do agree with your suggestion. Let's remove the dependency from joblib and open a separate issue.

@fabclmnt fabclmnt self-requested a review October 5, 2022 15:17
Joblib was only used for `joblib.hash` for dataframes, but there's `hash_pandas_object` for that.

Tangentially refs ydataai#1056
@fabclmnt fabclmnt merged commit b891c40 into ydataai:develop Oct 7, 2022
vascoalramos pushed a commit that referenced this pull request Oct 20, 2022
Joblib was only used for `joblib.hash` for dataframes, but there's `hash_pandas_object` for that.

Tangentially refs #1056
portellaa pushed a commit that referenced this pull request Oct 20, 2022
Joblib was only used for `joblib.hash` for dataframes, but there's `hash_pandas_object` for that.

Tangentially refs #1056
vascoalramos pushed a commit that referenced this pull request Oct 20, 2022
Joblib was only used for `joblib.hash` for dataframes, but there's `hash_pandas_object` for that.

Tangentially refs #1056
vascoalramos pushed a commit that referenced this pull request Oct 21, 2022
Joblib was only used for `joblib.hash` for dataframes, but there's `hash_pandas_object` for that.

Tangentially refs #1056
vascoalramos pushed a commit that referenced this pull request Oct 21, 2022
Joblib was only used for `joblib.hash` for dataframes, but there's `hash_pandas_object` for that.

Tangentially refs #1056
chanedwin pushed a commit that referenced this pull request Oct 30, 2022
Joblib was only used for `joblib.hash` for dataframes, but there's `hash_pandas_object` for that.

Tangentially refs #1056
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants