text analysis using pandas_profiling #278

shahanesanket · 2019-10-23T19:17:48Z

Is your feature request related to a problem? Please describe.
I would like to analyze text fields the same way numeric and categorical fields are analyzed and reported. Especially, before working on any NLP problem it'll be very helpful and time saving to have this analysis done in a line of code.

To start with I would like to see:

Missing value analysis
Text length analysis
2.1 min, max, average, quantiles
2.2 freq words, infrequent words (can include the deepmoji project's tokenizer. it's very robust)
2.2 word cloud. (if it isn't a far stretched goal)

Currently, I am heavily relying on pandas_profiling and the only alternative I have is doing this text analysis manually. I would like to contribute if this is something the managers think of building into the project.

neomatrix369 · 2019-10-24T22:05:17Z

Hey @shahanesanket great idea, I have a library, underway, see https://bit.ly/better-nlp-launch, I would love to have these features embedded into it. We can then apply them into pandas-profiling or any other library, let me know what you think of the idea and if you like to collaborate on this idea together with me and others?

shahanesanket · 2019-10-30T15:38:48Z

Hi @neomatrix369 would love to contribute.

neomatrix369 · 2019-10-31T23:16:07Z

Hi @neomatrix369 would love to contribute.

How about you take a peek at the library and also the notebooks/kernels I have published, and then give me a shout if you need any help or have questions.

Otherwise, I'll be happy to receive any PR from you. You can also start a discussion on a topic related to the above and we can split the work between the two of us.

The only way to get started is to start with it!

sbrugman · 2019-11-01T11:56:19Z

@neomatrix369 @shahanesanket This discussion is out of scope of this repository, please continue it somewhere else (for example at the repository manu suggested above).

A key design decision in the pandas-profiling package is that analyses should be objective, to be useful for a broad audience. This means that relying on untransparent machine learning models are not considered for data profiling.

That being said, we have developed tangled-up-in-unicode to perform objective analysis provided the Unicode Character Database.

Note that you can always use model-specific predictions and add them to your DataFrame, and analyse those.

neomatrix369 · 2019-11-01T15:29:14Z

@neomatrix369 @shahanesanket This discussion is out of scope of this repository, please continue it somewhere else (for example at the repository manu suggested above).

A key design decision in the pandas-profiling package is that analyses should be objective, to be useful for a broad audience. This means that relying on untransparent machine learning models are not considered for data profiling.

That being said, we have developed tangled-up-in-unicode to perform objective analysis provided the Unicode Character Database.

Note that you can always use model-specific predictions and add them to your DataFrame, and analyse those.

Sorry about that @sbrugman - the intent of my points was to produce something that would be useful in general and also that could be incorporated into the pandas-profiling library - so it's win-win for both sides.

I have still to understand what you mean in the rest of your comment above but I'm thinking you know what you are talking about and happy to wait and see the above in pandas-profiling library..

- Progress bar added (#224) - Character analysis for Text/NLP (#278) - Themes: configuration and demo's (Orange, Dark) - Tutorial on modifying the report's structure (#362; #281, #259, #253, #234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling. - Toggle descriptions at correlations. Deprecation: - This is the last version to support Python 3.5. Stability: - The order of columns changed when sort="None" (#377, fixed). - Pandas v1.0.X is not yet supported (#367, #366, #363, #353, pinned pandas to < 1) - Improved mixed type detection (#351) - Refactor of report structures. - Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, #329). - Distinct counts exclude NaNs. - Fixed alerts in notebooks. Other improvements: - Warnings are now sorted. - Links to Binder and Google Colab are added for notebooks (#349) - The overview section is tabbed.

- Progress bar added (#224) - Character analysis for Text/NLP (#278) - Themes: configuration and demo's (Orange, Dark) - Tutorial on modifying the report's structure (#362; #281, #259, #253, #234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling. - Toggle descriptions at correlations. Deprecation: - This is the last version to support Python 3.5. Stability: - The order of columns changed when sort="None" (#377, fixed). - Pandas v1.0.X is not yet supported (#367, #366, #363, #353, pinned pandas to < 1) - Improved mixed type detection (#351) - Refactor of report structures. - Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, #329). - Distinct counts exclude NaNs. - Fixed alerts in notebooks. Other improvements: - Warnings are now sorted. - Links to Binder and Google Colab are added for notebooks (#349) - The overview section is tabbed. * Commit for pandas-profiling v2.5.0 - Progress bar added (#224) - Character analysis for Text/NLP (#278) - Themes: configuration and demo's (Orange, Dark) - Tutorial on modifying the report's structure (#362; #281, #259, #253, #234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling. - Toggle descriptions at correlations. Deprecation: - This is the last version to support Python 3.5. Stability: - The order of columns changed when sort="None" (#377, fixed). - Pandas v1.0.X is not yet supported (#367, #366, #363, #353, pinned pandas to < 1) - Improved mixed type detection (#351) - Refactor of report structures. - Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, #329). - Distinct counts exclude NaNs. - Fixed alerts in notebooks. Other improvements: - Warnings are now sorted. - Links to Binder and Google Colab are added for notebooks (#349) - The overview section is tabbed.

neomatrix369 · 2020-08-04T06:25:01Z

As a response to this issue I started working on a basic NLP profiler project:

Kaggle kernel: https://www.kaggle.com/neomatrix369/nlp-profiler-simple-dataset
Utility script: https://www.kaggle.com/neomatrix369/nlp-profiler-class

It's still early days and hopefully, I (or someone else) would love to integrate it with/into Pandas profiling. So far the response has been pretty good. Many are recognising it's potential and purpose.

I'm happy to invite you to continue discussing this on neomatrix369/awesome-ai-ml-dl#45

- Progress bar added (ydataai#224) - Character analysis for Text/NLP (ydataai#278) - Themes: configuration and demo's (Orange, Dark) - Tutorial on modifying the report's structure (ydataai#362; ydataai#281, ydataai#259, ydataai#253, ydataai#234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling. - Toggle descriptions at correlations. Deprecation: - This is the last version to support Python 3.5. Stability: - The order of columns changed when sort="None" (ydataai#377, fixed). - Pandas v1.0.X is not yet supported (ydataai#367, ydataai#366, ydataai#363, ydataai#353, pinned pandas to < 1) - Improved mixed type detection (ydataai#351) - Refactor of report structures. - Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, ydataai#329). - Distinct counts exclude NaNs. - Fixed alerts in notebooks. Other improvements: - Warnings are now sorted. - Links to Binder and Google Colab are added for notebooks (ydataai#349) - The overview section is tabbed. * Commit for pandas-profiling v2.5.0 - Progress bar added (ydataai#224) - Character analysis for Text/NLP (ydataai#278) - Themes: configuration and demo's (Orange, Dark) - Tutorial on modifying the report's structure (ydataai#362; ydataai#281, ydataai#259, ydataai#253, ydataai#234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling. - Toggle descriptions at correlations. Deprecation: - This is the last version to support Python 3.5. Stability: - The order of columns changed when sort="None" (ydataai#377, fixed). - Pandas v1.0.X is not yet supported (ydataai#367, ydataai#366, ydataai#363, ydataai#353, pinned pandas to < 1) - Improved mixed type detection (ydataai#351) - Refactor of report structures. - Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, ydataai#329). - Distinct counts exclude NaNs. - Fixed alerts in notebooks. Other improvements: - Warnings are now sorted. - Links to Binder and Google Colab are added for notebooks (ydataai#349) - The overview section is tabbed.

shahanesanket added the feature request 💬 Requests for new features label Oct 23, 2019

neomatrix369 mentioned this issue Nov 1, 2019

Add more features to the BetterNLP library neomatrix369/awesome-ai-ml-dl#45

Open

sbrugman added this to the next release milestone Jan 23, 2020

sbrugman mentioned this issue Feb 14, 2020

Commit for pandas-profiling v2.5.0 #380

Merged

sbrugman closed this as completed Feb 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text analysis using pandas_profiling #278

text analysis using pandas_profiling #278

shahanesanket commented Oct 23, 2019

neomatrix369 commented Oct 24, 2019

shahanesanket commented Oct 30, 2019

neomatrix369 commented Oct 31, 2019

sbrugman commented Nov 1, 2019 •

edited

Loading

neomatrix369 commented Nov 1, 2019 •

edited

Loading

neomatrix369 commented Aug 4, 2020 •

edited

Loading

text analysis using pandas_profiling #278

text analysis using pandas_profiling #278

Comments

shahanesanket commented Oct 23, 2019

neomatrix369 commented Oct 24, 2019

shahanesanket commented Oct 30, 2019

neomatrix369 commented Oct 31, 2019

sbrugman commented Nov 1, 2019 • edited Loading

neomatrix369 commented Nov 1, 2019 • edited Loading

neomatrix369 commented Aug 4, 2020 • edited Loading

sbrugman commented Nov 1, 2019 •

edited

Loading

neomatrix369 commented Nov 1, 2019 •

edited

Loading

neomatrix369 commented Aug 4, 2020 •

edited

Loading