-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: drop joblib dependency #1090
Conversation
Codecov ReportBase: 90.91% // Head: 90.92% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## develop #1090 +/- ##
========================================
Coverage 90.91% 90.92%
========================================
Files 174 173 -1
Lines 4929 4934 +5
========================================
+ Hits 4481 4486 +5
Misses 448 448
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
# binary representation would be more efficient, but it's not | ||
# necessarily portable across architectures. Using the human-readable | ||
# string values should be good enough. | ||
hash_values = "\n".join(hash_pandas_object(df).values.astype(str)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do recall that hash_pandas_object
had a serious limitation: not serializable due to circular dependencies.
Is this something that remains?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't imagine so, since this is not about serialization but about generating a hash...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's about hashing indeed and I'm aware of that, but it does matter if limitation introduced given we have some requests to enable the ProfileReport serialization and the hashing is stored in a property of the Class _df_hash
, if that limitation remains it is something to be taken into consideration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right – the _df_hash
field is computed by just calling this function:
https://github.com/ydataai/pandas-profiling/blob/7506bca4489649f317c9df0d6da60b514808c7df/src/pandas_profiling/profile_report.py#L190-L194
so the value will be a str
, which is always naturally serializable. (I see there's some additional magic involving that property in serialize_report
...)
However, I eyeball a separate bug in that if the .df
for a report is mutated, the hash is not invalidated – thus in fact, the property probably shouldn't have a backing _df_hash
field at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the value is serializable, the problem is the product of serialization that becomes not usable. Hence my question.
Nevertheless, I do agree with your suggestion. Let's remove the dependency from joblib and open a separate issue.
Joblib was only used for `joblib.hash` for dataframes, but there's `hash_pandas_object` for that. Tangentially refs ydataai#1056
Joblib was only used for `joblib.hash` for dataframes, but there's `hash_pandas_object` for that. Tangentially refs #1056
Joblib was only used for `joblib.hash` for dataframes, but there's `hash_pandas_object` for that. Tangentially refs #1056
Joblib was only used for `joblib.hash` for dataframes, but there's `hash_pandas_object` for that. Tangentially refs #1056
Joblib was only used for `joblib.hash` for dataframes, but there's `hash_pandas_object` for that. Tangentially refs #1056
Joblib was only used for `joblib.hash` for dataframes, but there's `hash_pandas_object` for that. Tangentially refs #1056
Joblib was only used for `joblib.hash` for dataframes, but there's `hash_pandas_object` for that. Tangentially refs #1056
Joblib was only used for
joblib.hash
for dataframes, but there'shash_pandas_object
for that.This naturally changes the hashes generated (but so could have any internal change in joblib so far) and furnishes for that by adding a
2@
(version 2) prefix to the newly-generated hashes.Tangentially refs #1056