Skip to content

Commit

Permalink
faq fix (#13986)
Browse files Browse the repository at this point in the history
  • Loading branch information
agsfer authored Sep 15, 2023
1 parent dbb408f commit 5e2a023
Show file tree
Hide file tree
Showing 7 changed files with 50 additions and 86 deletions.
3 changes: 2 additions & 1 deletion docs/_data/navigation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,9 @@ header:
- title: Demo
url: /infer_meaning_intent
key: demo
- title: '<a href="https://www.johnsnowlabs.com/spark-nlp-blog/" target="_blank">Blog</a>'
- title: Blog
url: https://www.johnsnowlabs.com/spark-nlp-blog/
blank: yes
key: blog
# - title: '<span style="color: #FF8A00;"><i class = "fab fa-github fa-2x"></i></span>'
# url: https://github.com/JohnSnowLabs/spark-nlp
Expand Down
4 changes: 2 additions & 2 deletions docs/_includes/header.html
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,9 @@
{%- assign _page_url = __return -%}
{%- include snippets/get-string-from-locale-config.html locale=_item.titles -%}
{%- if _nav_url == _page_url or page.nav_key and _item.key and page.nav_key == _item.key -%}
<li class="navigation__item navigation__item--active"><a href="{{ _nav_url }}">{%- if _item.title -%}{{ _item.title }}{%- else -%}{{ __return }}{%- endif -%}</a></li>
<li class="navigation__item navigation__item--active"><a {% if _item.blank %} target="_blank" {% endif %} href="{{ _nav_url }}">{%- if _item.title -%}{{ _item.title }}{%- else -%}{{ __return }}{%- endif -%}</a></li>
{%- else -%}
<li class="navigation__item {{ article_header.uniq_class }}"><a href="{{ _nav_url }}">{%- if _item.title -%}{{ _item.title }}{%- else -%}{{ __return }}{%- endif -%}</a></li>
<li class="navigation__item {{ article_header.uniq_class }}"><a {% if _item.blank %} target="_blank" {% endif %} href="{{ _nav_url }}">{%- if _item.title -%}{{ _item.title }}{%- else -%}{{ __return }}{%- endif -%}</a></li>
{%- endif -%}
{%- endfor -%}
{%- if site.search.provider -%}
Expand Down
71 changes: 24 additions & 47 deletions docs/en/CPUvsGPUbenchmark.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
---
layout: docs
header: true
seotitle: GPU vs CPU benchmark
title: GPU vs CPU benchmark
permalink: /docs/en/CPUvsGPUbenchmark
key: docs-concepts
modify_date: "2023-04-06"
use_language_switcher: "Python-Scala"
show_nav: true
sidebar:
nav: sparknlp
Expand All @@ -19,8 +19,7 @@ Different benchmarks, as well as their takeaways and some conclusions of how to

Each major release comes with big improvements, so please, make sure you use at least that version to fully levearge Spark NLP capabilities on GPU.

</div>
<div class="h3-box" markdown="1">
</div><div class="h3-box" markdown="1">

### Machine specs

Expand All @@ -30,8 +29,7 @@ An AWS `m5.8xlarge` machine was used for the CPU benchmarking. This machine cons
#### GPU
A `Tesla V100 SXM2` GPU with `32GB` of memory was used to calculate the GPU benchmarking.

</div>
<div class="h3-box" markdown="1">
</div><div class="h3-box" markdown="1">

### Versions
The benchmarking was carried out with the following Spark NLP versions:
Expand All @@ -44,8 +42,7 @@ SparkNLP version: `3.3.4`

Spark nodes: 1

</div>
<div class="h3-box" markdown="1">
</div><div class="h3-box" markdown="1">

### Benchmark on classifierDLApproach()

Expand All @@ -56,8 +53,7 @@ We used the Spark NLP class `ClassifierDL` and it's method `Approach()` as descr
The pipeline looks as follows:
![](/assets/images/gpu_v2_pic3.png)

</div>
<div class="h3-box" markdown="1">
</div><div class="h3-box" markdown="1">

#### Dataset
The size of the dataset was relatively small (200K), consisting of:
Expand All @@ -66,8 +62,7 @@ Training (rows): `162250`

Test (rows): `40301`

</div>
<div class="h3-box" markdown="1">
</div><div class="h3-box" markdown="1">

#### Training params
Different batch sizes were tested to demonstrate how GPU performance improves with bigger batches compared to CPU, for a constant number of epochs and learning rate.
Expand All @@ -78,14 +73,12 @@ Learning rate: `0.003`

Batch sizes: `32`, `64`, `256`, `1024`

</div>
<div class="h3-box" markdown="1">
</div><div class="h3-box" markdown="1">

#### Results
Even for this average-sized dataset, we can observe that GPU is able to beat the CPU machine by a `76%` in both `training` and `inference` times.

</div>
<div class="h3-box" markdown="1">
</div><div class="h3-box" markdown="1">

#### Training times depending on batch (in minutes)

Expand All @@ -100,8 +93,7 @@ Even for this average-sized dataset, we can observe that GPU is able to beat the
| 256 | 64 | 14.5 |
| 1024 | 64 | 14 |

</div>
<div class="h3-box" markdown="1">
</div><div class="h3-box" markdown="1">

#### Inference times (in minutes)
The average inference time remained more or less constant regardless the batch size:
Expand All @@ -110,16 +102,14 @@ GPU: `2 min`

![](/assets/images/gpu_v2_pic5.png)

</div>
<div class="h3-box" markdown="1">
</div><div class="h3-box" markdown="1">

#### Performance metrics
A weighted F1-score of 0.88 was achieved, with a 0.90 score for question detection and 0.83 for statements.

![](/assets/images/gpu_v2_pic2.png)

</div>
<div class="h3-box" markdown="1">
</div><div class="h3-box" markdown="1">

### Benchmark on NerDLApproach()

Expand All @@ -130,8 +120,7 @@ We used the Spark NLP class `NerDL` and it's method `Approach()` as described in
The pipeline looks as follows:
![](/assets/images/gpu_v2_pic4.png)

</div>
<div class="h3-box" markdown="1">
</div><div class="h3-box" markdown="1">

#### Dataset
The size of the dataset was small (17K), consisting of:
Expand All @@ -140,8 +129,7 @@ Training (rows): `14041`

Test (rows): `3250`

</div>
<div class="h3-box" markdown="1">
</div><div class="h3-box" markdown="1">

#### Training params
Different batch sizes were tested to demonstrate how GPU performance improves with bigger batches compared to CPU, for a constant number of epochs and learning rate.
Expand All @@ -152,14 +140,12 @@ Learning rate: `0.003`

Batch sizes: `32`, `64`, `256`, `512`, `1024`, `2048`

</div>
<div class="h3-box" markdown="1">
</div><div class="h3-box" markdown="1">

#### Results
Even for this small dataset, we can observe that GPU is able to beat the CPU machine by a `62%` in `training` time and a `68%` in `inference` times. It's important to mention that the batch size is very relevant when using GPU, since CPU scales much worse with bigger batch sizes than GPU.

</div>
<div class="h3-box" markdown="1">
</div><div class="h3-box" markdown="1">

#### Training times depending on batch (in minutes)

Expand All @@ -175,8 +161,7 @@ Even for this small dataset, we can observe that GPU is able to beat the CPU mac
| 1024 | 6.5 | 2.5 |
| 2048 | 6.5 | 2.5 |

</div>
<div class="h3-box" markdown="1">
</div><div class="h3-box" markdown="1">

#### Inference times (in minutes)
Although CPU times in inference remain more or less constant regardless the batch sizes, GPU time experiment good improvements the bigger the batch size is.
Expand All @@ -195,16 +180,14 @@ CPU times: `~29 min`

![](/assets/images/gpu_v2_pic7.png)

</div>
<div class="h3-box" markdown="1">
</div><div class="h3-box" markdown="1">

#### Performance metrics
A macro F1-score of about `0.92` (`0.90` in micro) was achieved, with the following charts extracted from the `NERDLApproach()` logs:

![](/assets/images/gpu_v2_pic8.png)

</div>
<div class="h3-box" markdown="1">
</div><div class="h3-box" markdown="1">

### Inference benchmark on BertSentenceEmbeddings()

Expand All @@ -215,22 +198,19 @@ We used the Spark NLP class `BertSentenceEmbeddings()` described in the Transfor
The pipeline contains only two components and looks as follows:
![](/assets/images/gpu_v2_pic9.png)

</div>
<div class="h3-box" markdown="1">
</div><div class="h3-box" markdown="1">

#### Dataset
The size of the dataset was bigger than the previous ones, with `417735` rows for inference.

</div>
<div class="h3-box" markdown="1">
</div><div class="h3-box" markdown="1">

#### Results
We have observed in previous experiments, using BertSentenceEmbeddings (classifierDL) and also BertEmbeddings (NerDL) how GPU improved both training and inference times. In this case, we observe again big improvements in inference, what is already pointing that one of the main reasons of why GPU improves so much over CPU is the better management of Embeddings (word, sentence level) and bigger batch sizes.

Batch sizes: `32`, `64`, `256`, `1024`

</div>
<div class="h3-box" markdown="1">
</div><div class="h3-box" markdown="1">

#### Inference times depending on batch (in minutes)
![](/assets/images/gpu_v2_pic10.png)
Expand All @@ -243,8 +223,7 @@ Batch sizes: `32`, `64`, `256`, `1024`
| 256 | 63 | 9.4 |
| 1024 | 62 | 9.1 |

</div>
<div class="h3-box" markdown="1">
</div><div class="h3-box" markdown="1">

### Takeaways: How to get the best of the GPU
You will experiment big GPU improvements in the following cases:
Expand All @@ -254,14 +233,12 @@ You will experiment big GPU improvements in the following cases:
2. Bigger batch sizes get the best of GPU, while CPU does not scale with bigger batch sizes;
3. Bigger dataset sizes get the best of GPU, while may be a bottleneck while running in CPU and lead to performance drops;

</div>
<div class="h3-box" markdown="1">
</div><div class="h3-box" markdown="1">

### MultiGPU training
Right now, we don't support multigpu training (1 model in different GPUs in parallel), but you can train different models in different GPU.

</div>
<div class="h3-box" markdown="1">
</div><div class="h3-box" markdown="1">

### Where to look for more information about Training
Please, take a look at the [Spark NLP](https://sparknlp.org/docs/en/training) and [Spark NLP for Healthcare](https://sparknlp.org/docs/en/licensed_training) Training sections, and feel free to reach us out in case you want to maximize the performance on your GPU.
Expand Down
2 changes: 1 addition & 1 deletion docs/en/annotators.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
---
layout: docs
header: true
seotitle: Spark NLP - Annotators
title: Spark NLP - Annotators
permalink: /docs/en/annotators
key: docs-annotators
modify_date: "2021-04-17"
use_language_switcher: "Python-Scala"
show_nav: true
sidebar:
nav: sparknlp
Expand Down
3 changes: 1 addition & 2 deletions docs/en/auxiliary.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
---
layout: docs
header: true
seotitle: Spark NLP
seotitle: Spark NLP - Helper functions
title: Helper functions
permalink: /docs/en/auxiliary
key: docs-auxiliary
modify_date: "2019-11-28"
use_language_switcher: "Python-Scala"
show_nav: true
sidebar:
nav: sparknlp
Expand Down
10 changes: 3 additions & 7 deletions docs/en/quickstart.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
---
layout: docs
header: true
title: Spark NLP - Quick Start
seotitle: Spark NLP - Getting Started
title: Spark NLP - Quick Start
permalink: /docs/en/quickstart
key: docs-quickstart
modify_date: "2021-03-20"
Expand Down Expand Up @@ -82,10 +82,6 @@ It is full of fresh examples and even a docker container if you want to skip ins

If you need more detailed information about how to install Spark NLP you can check the [Installation page](install)

Detailed information about Spark NLP concepts, annotators and more may
be found [HERE](annotators)

</div>

Detailed information about Spark NLP concepts, annotators and more may be found [HERE](annotators)

</div>
</div></div>
43 changes: 17 additions & 26 deletions docs/en/third-party-projects.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,31 +6,24 @@ title: Third Party Projects
permalink: /docs/en/third-party-projects
key: docs-third-party-projects
modify_date: "2021-10-25"
use_language_switcher: "Python-Scala"
show_nav: true
sidebar:
nav: sparknlp
---

There are third party projects that can integrate with Spark NLP. These
packages need to be installed separately to be used.
<div class="h3-box" markdown="1">

If you'd like to integrate your application with Spark NLP, please send us a
message!
There are third party projects that can integrate with Spark NLP. These packages need to be installed separately to be used.

If you'd like to integrate your application with Spark NLP, please send us a message!

## Logging

### Comet

[Comet](https://www.comet.ml/) is a meta machine learning platform designed
to help AI practitioners and teams build reliable machine learning models for
real-world applications by streamlining the machine learning model lifecycle. By
leveraging Comet, users can track, compare, explain and reproduce their machine
learning experiments.
[Comet](https://www.comet.ml/) is a meta machine learning platform designed to help AI practitioners and teams build reliable machine learning models for real-world applications by streamlining the machine learning model lifecycle. By leveraging Comet, users can track, compare, explain and reproduce their machine learning experiments.

Comet can easily integrated into the Spark NLP workflow with the a dedicated
logging class `CometLogger` to log training and evaluation metrics,
pipeline parameters and NER visualization made with sparknlp-display.
Comet can easily integrated into the Spark NLP workflow with the a dedicated logging class `CometLogger` to log training and evaluation metrics, pipeline parameters and NER visualization made with sparknlp-display.

For more information see the [User Guide](/api/python/third_party/Comet.html) and for more examples see the [Spark NLP Examples](https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/logging/Comet_SparkNLP_Integration.ipynb).

Expand Down Expand Up @@ -96,19 +89,17 @@ logger.experiment.display(tab='charts')

</details>

</div><div class="h3-box" markdown="1">

### MLflow

Spark NLP uses Spark MLlib Pipelines, what are natively supported by MLFlow.
MLFlow is, as stated in their [official webpage](https://mlflow.org/), an open
source platform for the machine learning lifecycle, that includes:
* **Mlflow Tracking**: Record and query experiments: code, data, config, and
results
* **MLflow Projects**: Package data science code in a format to reproduce runs
on any platform
* **MLflow Models**: Deploy machine learning models in diverse serving
environments
* **Model Registry**: Store, annotate, discover, and manage models in a central
repository

For more information, please see the complete guide at [Experiment
Tracking](/docs/en/mlflow).
MLFlow is, as stated in their [official webpage](https://mlflow.org/), an open source platform for the machine learning lifecycle, that includes:
* **Mlflow Tracking**: Record and query experiments: code, data, config, and results
* **MLflow Projects**: Package data science code in a format to reproduce runs on any platform
* **MLflow Models**: Deploy machine learning models in diverse serving environments
* **Model Registry**: Store, annotate, discover, and manage models in a central repository

For more information, please see the complete guide at [Experiment Tracking](/docs/en/mlflow).

</div>

0 comments on commit 5e2a023

Please sign in to comment.