[SPARKNLP-1103] Adding support to read power point files #14491

danilojsl · 2024-12-24T20:23:07Z

Description

This pull request introduces a new feature that enables reading and parsing Microsoft PowerPoint files (.pptx and .ppt) into a structured Spark DataFrame. Leveraging this functionality allows for efficient processing and analysis of PowerPoint slides content, seamlessly integrating with Spark NLP for enhanced downstream natural language processing tasks.

Important: Please do not merge this PR until #14489 is merged.

Motivation and Context

Structured Data Representation: By transforming PowerPoint slides content into a well-defined DataFrame structure, we enable seamless integration with Spark’s powerful analytical and data processing capabilities.
Scalability: Leveraging Spark’s distributed architecture, this feature supports the efficient processing of large volumes of PowerPoint slides data, which is critical for big data projects.
Simplified Data Manipulation: Working with a structured DataFrame simplifies data manipulation tasks, such as filtering, aggregating, and transforming PowerPoint slides content, reducing complexity and improving productivity.
Enhanced Context for LLM Tasks: By organizing Excel sheets into structured formats, we can curate and provide more specific, context-rich content for large language model (LLM) prompts. This improves the quality of LLM-generated outputs by allowing for more targeted and relevant contextual information, which is essential for applications like NLU and content generation.

How Has This Been Tested?

Screenshots (if appropriate):

Local Tests
Google Colab notebook
Databricks notebooks

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

…tion metadata for tables to all readers

…int reader

danilojsl added 3 commits December 17, 2024 18:58

[SPARKNLP-1102] Adding support to read Excel files

4657490

[SPARKNLP-1102] Adding notebook example to read Excel files

04a22dd

[SPARKNLP-1103] Adding support to read PowerPoint files and adds loca…

f2a9ed9

…tion metadata for tables to all readers

danilojsl added DON'T MERGE Do not merge this PR new-feature Introducing a new feature labels Dec 24, 2024

danilojsl requested a review from maziyarpanahi December 24, 2024 20:23

[SPARKNLP-1103] Adding documentation and notebook example for PowerPo…

175af0f

…int reader

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARKNLP-1103] Adding support to read power point files #14491

[SPARKNLP-1103] Adding support to read power point files #14491

danilojsl commented Dec 24, 2024

[SPARKNLP-1103] Adding support to read power point files #14491

Are you sure you want to change the base?

[SPARKNLP-1103] Adding support to read power point files #14491

Conversation

danilojsl commented Dec 24, 2024

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist: