Skip to content

Commit

Permalink
SPARKNLP-1094 Adding Support to Read Word Files (#14476)
Browse files Browse the repository at this point in the history
* [SPARKNLP-1094] Adding support to read Word files

* [SPARKNLP-1094] Adding test doc examples

* [SPARKNLP-1094] Adding notebook example for Word files

* [SPARKNLP-1094] Adding latest jakarta mail dependencies

* [SPARKNLP-1094] Updating notebooks documentation

* [SPARKNLP-1094] Updating Databricks documentation
  • Loading branch information
danilojsl authored Dec 15, 2024
1 parent 48c61bb commit acc9369
Show file tree
Hide file tree
Showing 17 changed files with 1,039 additions and 334 deletions.
7 changes: 6 additions & 1 deletion build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,12 @@ lazy val utilDependencies = Seq(
azureIdentity,
azureStorage,
jsoup,
jakartaMail
jakartaMail,
angusMail,
poiDocx
exclude ("org.apache.logging.log4j", "log4j-api"),
scratchpad
exclude ("org.apache.logging.log4j", "log4j-api")
)

lazy val typedDependencyParserDependencies = Seq(junit)
Expand Down
10 changes: 10 additions & 0 deletions docs/en/advanced_settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,16 @@ spark.jsl.settings.annotator.log_folder dbfs:/PATH_TO_LOGS

NOTE: If this is an existing cluster, after adding new configs or changing existing properties you need to restart it.

#### Additional Configuration for Databricks
When running Email Reader feature `sparknlp.read().email("./email-files")` on Databricks, it is necessary to include the following Spark configurations to avoid dependency conflicts:

```bash
spark.driver.userClassPathFirst true
spark.executor.userClassPathFirst true
```
These configurations are required because the Databricks runtime environment includes a bundled version of the `com.sun.mail:jakarta.mail` library, which conflicts with `jakarta.activation`.
By setting these properties, the application ensures that the user-provided libraries take precedence over those bundled in the Databricks environment, resolving the dependency conflict.

</div><div class="h3-box" markdown="1">

### S3 Integration
Expand Down
Loading

0 comments on commit acc9369

Please sign in to comment.