Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text analysis using pandas_profiling #278

Closed
shahanesanket opened this issue Oct 23, 2019 · 6 comments
Closed

text analysis using pandas_profiling #278

shahanesanket opened this issue Oct 23, 2019 · 6 comments
Labels
feature request 💬 Requests for new features

Comments

@shahanesanket
Copy link

Is your feature request related to a problem? Please describe.
I would like to analyze text fields the same way numeric and categorical fields are analyzed and reported. Especially, before working on any NLP problem it'll be very helpful and time saving to have this analysis done in a line of code.

To start with I would like to see:

  1. Missing value analysis
  2. Text length analysis
    2.1 min, max, average, quantiles
    2.2 freq words, infrequent words (can include the deepmoji project's tokenizer. it's very robust)
    2.2 word cloud. (if it isn't a far stretched goal)

Currently, I am heavily relying on pandas_profiling and the only alternative I have is doing this text analysis manually. I would like to contribute if this is something the managers think of building into the project.

@shahanesanket shahanesanket added the feature request 💬 Requests for new features label Oct 23, 2019
@neomatrix369
Copy link

Hey @shahanesanket great idea, I have a library, underway, see https://bit.ly/better-nlp-launch, I would love to have these features embedded into it. We can then apply them into pandas-profiling or any other library, let me know what you think of the idea and if you like to collaborate on this idea together with me and others?

@shahanesanket
Copy link
Author

Hi @neomatrix369 would love to contribute.

@neomatrix369
Copy link

Hi @neomatrix369 would love to contribute.

How about you take a peek at the library and also the notebooks/kernels I have published, and then give me a shout if you need any help or have questions.

Otherwise, I'll be happy to receive any PR from you. You can also start a discussion on a topic related to the above and we can split the work between the two of us.

The only way to get started is to start with it!

@sbrugman
Copy link
Collaborator

sbrugman commented Nov 1, 2019

@neomatrix369 @shahanesanket This discussion is out of scope of this repository, please continue it somewhere else (for example at the repository manu suggested above).

A key design decision in the pandas-profiling package is that analyses should be objective, to be useful for a broad audience. This means that relying on untransparent machine learning models are not considered for data profiling.

That being said, we have developed tangled-up-in-unicode to perform objective analysis provided the Unicode Character Database.

Note that you can always use model-specific predictions and add them to your DataFrame, and analyse those.

@neomatrix369
Copy link

neomatrix369 commented Nov 1, 2019

@neomatrix369 @shahanesanket This discussion is out of scope of this repository, please continue it somewhere else (for example at the repository manu suggested above).

A key design decision in the pandas-profiling package is that analyses should be objective, to be useful for a broad audience. This means that relying on untransparent machine learning models are not considered for data profiling.

That being said, we have developed tangled-up-in-unicode to perform objective analysis provided the Unicode Character Database.

Note that you can always use model-specific predictions and add them to your DataFrame, and analyse those.

Sorry about that @sbrugman - the intent of my points was to produce something that would be useful in general and also that could be incorporated into the pandas-profiling library - so it's win-win for both sides.

I have still to understand what you mean in the rest of your comment above but I'm thinking you know what you are talking about and happy to wait and see the above in pandas-profiling library..

@sbrugman sbrugman added this to the next release milestone Jan 23, 2020
sbrugman added a commit that referenced this issue Feb 14, 2020
- Progress bar added (#224)
- Character analysis for Text/NLP (#278)
- Themes: configuration and demo's (Orange, Dark)
- Tutorial on modifying the report's structure (#362; #281, #259, #253, #234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling.
- Toggle descriptions at correlations.

Deprecation:

- This is the last version to support Python 3.5.

Stability:

- The order of columns changed when sort="None" (#377, fixed).
- Pandas v1.0.X is not yet supported (#367, #366, #363, #353, pinned pandas to < 1)
- Improved mixed type detection (#351)
- Refactor of report structures.
- Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, #329).
- Distinct counts exclude NaNs.
- Fixed alerts in notebooks.

Other improvements:

- Warnings are now sorted.
- Links to Binder and Google Colab are added for notebooks (#349)
- The overview section is tabbed.
sbrugman added a commit that referenced this issue Feb 14, 2020
- Progress bar added (#224)
- Character analysis for Text/NLP (#278)
- Themes: configuration and demo's (Orange, Dark)
- Tutorial on modifying the report's structure (#362; #281, #259, #253, #234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling.
- Toggle descriptions at correlations.

Deprecation:

- This is the last version to support Python 3.5.

Stability:

- The order of columns changed when sort="None" (#377, fixed).
- Pandas v1.0.X is not yet supported (#367, #366, #363, #353, pinned pandas to < 1)
- Improved mixed type detection (#351)
- Refactor of report structures.
- Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, #329).
- Distinct counts exclude NaNs.
- Fixed alerts in notebooks.

Other improvements:

- Warnings are now sorted.
- Links to Binder and Google Colab are added for notebooks (#349)
- The overview section is tabbed.
sbrugman added a commit that referenced this issue Feb 14, 2020
- Progress bar added (#224)
- Character analysis for Text/NLP (#278)
- Themes: configuration and demo's (Orange, Dark)
- Tutorial on modifying the report's structure (#362; #281, #259, #253, #234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling.
- Toggle descriptions at correlations.

Deprecation:

- This is the last version to support Python 3.5.

Stability:

- The order of columns changed when sort="None" (#377, fixed).
- Pandas v1.0.X is not yet supported (#367, #366, #363, #353, pinned pandas to < 1)
- Improved mixed type detection (#351)
- Refactor of report structures.
- Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, #329).
- Distinct counts exclude NaNs.
- Fixed alerts in notebooks.

Other improvements:

- Warnings are now sorted.
- Links to Binder and Google Colab are added for notebooks (#349)
- The overview section is tabbed.

* Commit for pandas-profiling v2.5.0

- Progress bar added (#224)
- Character analysis for Text/NLP (#278)
- Themes: configuration and demo's (Orange, Dark)
- Tutorial on modifying the report's structure (#362; #281, #259, #253, #234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling.
- Toggle descriptions at correlations.

Deprecation:

- This is the last version to support Python 3.5.

Stability:

- The order of columns changed when sort="None" (#377, fixed).
- Pandas v1.0.X is not yet supported (#367, #366, #363, #353, pinned pandas to < 1)
- Improved mixed type detection (#351)
- Refactor of report structures.
- Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, #329).
- Distinct counts exclude NaNs.
- Fixed alerts in notebooks.

Other improvements:

- Warnings are now sorted.
- Links to Binder and Google Colab are added for notebooks (#349)
- The overview section is tabbed.
@neomatrix369
Copy link

neomatrix369 commented Aug 4, 2020

As a response to this issue I started working on a basic NLP profiler project:

It's still early days and hopefully, I (or someone else) would love to integrate it with/into Pandas profiling. So far the response has been pretty good. Many are recognising it's potential and purpose.

I'm happy to invite you to continue discussing this on neomatrix369/awesome-ai-ml-dl#45

chanedwin pushed a commit to chanedwin/pandas-profiling that referenced this issue Oct 11, 2020
- Progress bar added (ydataai#224)
- Character analysis for Text/NLP (ydataai#278)
- Themes: configuration and demo's (Orange, Dark)
- Tutorial on modifying the report's structure (ydataai#362; ydataai#281, ydataai#259, ydataai#253, ydataai#234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling.
- Toggle descriptions at correlations.

Deprecation:

- This is the last version to support Python 3.5.

Stability:

- The order of columns changed when sort="None" (ydataai#377, fixed).
- Pandas v1.0.X is not yet supported (ydataai#367, ydataai#366, ydataai#363, ydataai#353, pinned pandas to < 1)
- Improved mixed type detection (ydataai#351)
- Refactor of report structures.
- Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, ydataai#329).
- Distinct counts exclude NaNs.
- Fixed alerts in notebooks.

Other improvements:

- Warnings are now sorted.
- Links to Binder and Google Colab are added for notebooks (ydataai#349)
- The overview section is tabbed.

* Commit for pandas-profiling v2.5.0

- Progress bar added (ydataai#224)
- Character analysis for Text/NLP (ydataai#278)
- Themes: configuration and demo's (Orange, Dark)
- Tutorial on modifying the report's structure (ydataai#362; ydataai#281, ydataai#259, ydataai#253, ydataai#234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling.
- Toggle descriptions at correlations.

Deprecation:

- This is the last version to support Python 3.5.

Stability:

- The order of columns changed when sort="None" (ydataai#377, fixed).
- Pandas v1.0.X is not yet supported (ydataai#367, ydataai#366, ydataai#363, ydataai#353, pinned pandas to < 1)
- Improved mixed type detection (ydataai#351)
- Refactor of report structures.
- Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, ydataai#329).
- Distinct counts exclude NaNs.
- Fixed alerts in notebooks.

Other improvements:

- Warnings are now sorted.
- Links to Binder and Google Colab are added for notebooks (ydataai#349)
- The overview section is tabbed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request 💬 Requests for new features
Projects
None yet
Development

No branches or pull requests

3 participants