-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
text analysis using pandas_profiling #278
Comments
Hey @shahanesanket great idea, I have a library, underway, see https://bit.ly/better-nlp-launch, I would love to have these features embedded into it. We can then apply them into pandas-profiling or any other library, let me know what you think of the idea and if you like to collaborate on this idea together with me and others? |
Hi @neomatrix369 would love to contribute. |
How about you take a peek at the library and also the notebooks/kernels I have published, and then give me a shout if you need any help or have questions. Otherwise, I'll be happy to receive any PR from you. You can also start a discussion on a topic related to the above and we can split the work between the two of us. The only way to get started is to start with it! |
@neomatrix369 @shahanesanket This discussion is out of scope of this repository, please continue it somewhere else (for example at the repository manu suggested above). A key design decision in the pandas-profiling package is that analyses should be objective, to be useful for a broad audience. This means that relying on untransparent machine learning models are not considered for data profiling. That being said, we have developed Note that you can always use model-specific predictions and add them to your DataFrame, and analyse those. |
Sorry about that @sbrugman - the intent of my points was to produce something that would be useful in general and also that could be incorporated into the pandas-profiling library - so it's win-win for both sides. I have still to understand what you mean in the rest of your comment above but I'm thinking you know what you are talking about and happy to wait and see the above in pandas-profiling library.. |
- Progress bar added (#224) - Character analysis for Text/NLP (#278) - Themes: configuration and demo's (Orange, Dark) - Tutorial on modifying the report's structure (#362; #281, #259, #253, #234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling. - Toggle descriptions at correlations. Deprecation: - This is the last version to support Python 3.5. Stability: - The order of columns changed when sort="None" (#377, fixed). - Pandas v1.0.X is not yet supported (#367, #366, #363, #353, pinned pandas to < 1) - Improved mixed type detection (#351) - Refactor of report structures. - Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, #329). - Distinct counts exclude NaNs. - Fixed alerts in notebooks. Other improvements: - Warnings are now sorted. - Links to Binder and Google Colab are added for notebooks (#349) - The overview section is tabbed.
- Progress bar added (#224) - Character analysis for Text/NLP (#278) - Themes: configuration and demo's (Orange, Dark) - Tutorial on modifying the report's structure (#362; #281, #259, #253, #234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling. - Toggle descriptions at correlations. Deprecation: - This is the last version to support Python 3.5. Stability: - The order of columns changed when sort="None" (#377, fixed). - Pandas v1.0.X is not yet supported (#367, #366, #363, #353, pinned pandas to < 1) - Improved mixed type detection (#351) - Refactor of report structures. - Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, #329). - Distinct counts exclude NaNs. - Fixed alerts in notebooks. Other improvements: - Warnings are now sorted. - Links to Binder and Google Colab are added for notebooks (#349) - The overview section is tabbed.
- Progress bar added (#224) - Character analysis for Text/NLP (#278) - Themes: configuration and demo's (Orange, Dark) - Tutorial on modifying the report's structure (#362; #281, #259, #253, #234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling. - Toggle descriptions at correlations. Deprecation: - This is the last version to support Python 3.5. Stability: - The order of columns changed when sort="None" (#377, fixed). - Pandas v1.0.X is not yet supported (#367, #366, #363, #353, pinned pandas to < 1) - Improved mixed type detection (#351) - Refactor of report structures. - Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, #329). - Distinct counts exclude NaNs. - Fixed alerts in notebooks. Other improvements: - Warnings are now sorted. - Links to Binder and Google Colab are added for notebooks (#349) - The overview section is tabbed. * Commit for pandas-profiling v2.5.0 - Progress bar added (#224) - Character analysis for Text/NLP (#278) - Themes: configuration and demo's (Orange, Dark) - Tutorial on modifying the report's structure (#362; #281, #259, #253, #234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling. - Toggle descriptions at correlations. Deprecation: - This is the last version to support Python 3.5. Stability: - The order of columns changed when sort="None" (#377, fixed). - Pandas v1.0.X is not yet supported (#367, #366, #363, #353, pinned pandas to < 1) - Improved mixed type detection (#351) - Refactor of report structures. - Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, #329). - Distinct counts exclude NaNs. - Fixed alerts in notebooks. Other improvements: - Warnings are now sorted. - Links to Binder and Google Colab are added for notebooks (#349) - The overview section is tabbed.
As a response to this issue I started working on a basic NLP profiler project:
It's still early days and hopefully, I (or someone else) would love to integrate it with/into Pandas profiling. So far the response has been pretty good. Many are recognising it's potential and purpose. I'm happy to invite you to continue discussing this on neomatrix369/awesome-ai-ml-dl#45 |
- Progress bar added (ydataai#224) - Character analysis for Text/NLP (ydataai#278) - Themes: configuration and demo's (Orange, Dark) - Tutorial on modifying the report's structure (ydataai#362; ydataai#281, ydataai#259, ydataai#253, ydataai#234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling. - Toggle descriptions at correlations. Deprecation: - This is the last version to support Python 3.5. Stability: - The order of columns changed when sort="None" (ydataai#377, fixed). - Pandas v1.0.X is not yet supported (ydataai#367, ydataai#366, ydataai#363, ydataai#353, pinned pandas to < 1) - Improved mixed type detection (ydataai#351) - Refactor of report structures. - Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, ydataai#329). - Distinct counts exclude NaNs. - Fixed alerts in notebooks. Other improvements: - Warnings are now sorted. - Links to Binder and Google Colab are added for notebooks (ydataai#349) - The overview section is tabbed. * Commit for pandas-profiling v2.5.0 - Progress bar added (ydataai#224) - Character analysis for Text/NLP (ydataai#278) - Themes: configuration and demo's (Orange, Dark) - Tutorial on modifying the report's structure (ydataai#362; ydataai#281, ydataai#259, ydataai#253, ydataai#234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling. - Toggle descriptions at correlations. Deprecation: - This is the last version to support Python 3.5. Stability: - The order of columns changed when sort="None" (ydataai#377, fixed). - Pandas v1.0.X is not yet supported (ydataai#367, ydataai#366, ydataai#363, ydataai#353, pinned pandas to < 1) - Improved mixed type detection (ydataai#351) - Refactor of report structures. - Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, ydataai#329). - Distinct counts exclude NaNs. - Fixed alerts in notebooks. Other improvements: - Warnings are now sorted. - Links to Binder and Google Colab are added for notebooks (ydataai#349) - The overview section is tabbed.
Is your feature request related to a problem? Please describe.
I would like to analyze text fields the same way numeric and categorical fields are analyzed and reported. Especially, before working on any NLP problem it'll be very helpful and time saving to have this analysis done in a line of code.
To start with I would like to see:
2.1 min, max, average, quantiles
2.2 freq words, infrequent words (can include the deepmoji project's tokenizer. it's very robust)
2.2 word cloud. (if it isn't a far stretched goal)
Currently, I am heavily relying on pandas_profiling and the only alternative I have is doing this text analysis manually. I would like to contribute if this is something the managers think of building into the project.
The text was updated successfully, but these errors were encountered: