Skip to content

Commit

Permalink
Final edit on data preprocessing notebooks (#29940)
Browse files Browse the repository at this point in the history
  • Loading branch information
rszper authored Jan 6, 2024
1 parent f79eadd commit 3260a7b
Show file tree
Hide file tree
Showing 3 changed files with 13 additions and 15 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -63,9 +63,9 @@
{
"cell_type": "markdown",
"source": [
"[`ComputeAndApplyVocabulary`](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html#apache_beam.ml.transforms.tft.ComputeAndApplyVocabulary) is a data processing transform that computes a unique vocabulary from a dataset and then maps each word or token to a distinct integer index. Use this transform to change textual data into numerical representations for machine learning (ML) tasks.\n",
"The [`ComputeAndApplyVocabulary`](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html#apache_beam.ml.transforms.tft.ComputeAndApplyVocabulary) data processing transform computes a unique vocabulary from a dataset and then maps each word or token to a distinct integer index. Use this transform to change textual data into numerical representations for machine learning (ML) tasks.\n",
"\n",
"When you train ML models that use text data, generating a vocabulary on the incoming dataset is a crucial preprocessing step. By mapping words to numerical indices, the vocabulary reduces the complexity and dimensionality of dataset. This step allows ML models to process the same words in a consistent way.\n",
"When you train ML models that use text data, generating a vocabulary on the incoming dataset is an important preprocessing step. By mapping words to numerical indices, the vocabulary reduces the complexity and dimensionality of dataset. This step allows ML models to process the same words in a consistent way.\n",
"\n",
"This notebook shows how to use `MLTransform` to complete the following tasks:\n",
"* Use `write` mode to generate a vocabulary on the input text and assign an index value to each token.\n",
Expand Down Expand Up @@ -120,7 +120,7 @@
{
"cell_type": "markdown",
"source": [
"## Artifact location\n",
"## Use the artifact location\n",
"\n",
"In `write` mode, the artifact location is used to store artifacts, such as the vocabulary file generated by `ComputeAndApplyVocabulary`.\n",
"\n",
Expand Down Expand Up @@ -163,7 +163,7 @@
{
"cell_type": "markdown",
"source": [
"In this example, in `write` mode, `MLTransform` uses `ComputeAndApplyVocabulary` to generate vocabulary on the incoming dataset. The incoming text data is split into tokens and each token is assigned an unique index.\n",
"In this example, in `write` mode, `MLTransform` uses `ComputeAndApplyVocabulary` to generate vocabulary on the incoming dataset. The incoming text data is split into tokens. Each token is assigned an unique index.\n",
"\n",
" The generated vocabulary is stored in an artifact location that you can use on a different dataset in `read` mode."
],
Expand Down Expand Up @@ -270,7 +270,7 @@
{
"cell_type": "markdown",
"source": [
"## Frequency Threshold\n",
"## Set the frequency threshold\n",
"\n",
"The `frequency_threshold` parameter identifies the elements that appear frequently in the dataset. This parameter limits the generated vocabulary to elements with an absolute frequency greater than or equal to the specified threshold. If you don't specify the parameter, the entire vocabulary is generated.\n",
"\n",
Expand Down Expand Up @@ -317,7 +317,7 @@
{
"cell_type": "markdown",
"source": [
"In the output, if the frequency of the token is less than the specified frequency, it is assigned to a `default_value` of `-1`. For the other tokens, a vocabulary file is generated."
"In the output, if the frequency of the token is less than the specified frequency, it's assigned to a `default_value` of `-1`. For the other tokens, a vocabulary file is generated."
],
"metadata": {
"id": "h1s4a6hzxKrb"
Expand Down Expand Up @@ -357,7 +357,7 @@
{
"cell_type": "markdown",
"source": [
"## `MLTransform` for inference workloads\n",
"## Use MLTransform for inference workloads\n",
"\n",
"When `MLTransform` is in `write` mode, it produces artifacts, such as vocabulary files for `ComputeAndApplyVocabulary`. These artifacts allow you to apply the same vocabulary, and any other preprocessing transforms, when you train your model and serve it in production, or when you test its accuracy.\n",
"\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -78,11 +78,11 @@
"\n",
"For each data processing transform, `MLTransform` runs in both `write` mode and `read` mode. For more information about using `MLTransform`, see [Preprocess data with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in the Apache Beam documentation.\n",
"\n",
"### MLTransform in write mode\n",
"## MLTransform in write mode\n",
"\n",
"When `MLTransform` is in `write` mode, it produces artifacts, such as minimum, maximum, and variance, for different data processing transforms. These artifacts allow you to ensure that you're applying the same artifacts, and any other preprocessing transforms, when you train your model and serve it in production, or when you test its accuracy.\n",
"\n",
"### MLTransform in read mode\n",
"## MLTransform in read mode\n",
"\n",
"In read mode, `MLTransform` uses the artifacts generated in `write` mode to scale the entire dataset."
],
Expand Down Expand Up @@ -146,7 +146,7 @@
{
"cell_type": "code",
"source": [
"# data used in MLTransform's write mode.\n",
"# data used in MLTransform's write mode\n",
"data = [\n",
" {'int_feature_1' : 11, 'int_feature_2': -10},\n",
" {'int_feature_1': 34, 'int_feature_2': -33},\n",
Expand All @@ -156,7 +156,7 @@
" {'int_feature_1': 63, 'int_feature_2': -21},\n",
"]\n",
"\n",
"# data used in MLTransform's read mode.\n",
"# data used in MLTransform's read mode\n",
"test_data = [\n",
" {'int_feature_1': 29, 'int_feature_2': -20},\n",
" {'int_feature_1': -5, 'int_feature_2': -11},\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@
"* **Machine translation:** Translate text from one language to another and preserve the meaning.\n",
"* **Text summarization:** Create shorter summaries of text.\n",
"\n",
"This notebook uses the Vertex AI text-embeddings API to generate text embeddings that use Google’s large generative artificial intelligence (AI) models. To generate text embeddings by using the Vertex AI text-embeddings API, use `MLTransform` with the `VertexAITextEmbeddings` class to specify the model configuration. For more information, see [Get text embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings). \n",
"This notebook uses the Vertex AI text-embeddings API to generate text embeddings that use Google’s large generative artificial intelligence (AI) models. To generate text embeddings by using the Vertex AI text-embeddings API, use `MLTransform` with the `VertexAITextEmbeddings` class to specify the model configuration. For more information, see [Get text embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings) in the Vertex AI documentation. \n",
"\n",
"For more information about using `MLTransform`, see [Preprocess data with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in the Apache Beam documentation.\n",
"\n",
Expand Down Expand Up @@ -156,9 +156,7 @@
"\n",
"### Use MLTransform in write mode\n",
"\n",
"In `write` mode, `MLTransform` saves the transforms and their attributes to an artifact location. Then, when you run `MLTransform` in `read` mode, these transforms are used. This process ensures that you're applying the same preprocessing steps when you train your model and when you serve the model in production or test its accuracy.\n",
"\n",
"For more information about using `MLTransform`, see [Preprocess data with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in the Apache Beam documentation."
"In `write` mode, `MLTransform` saves the transforms and their attributes to an artifact location. Then, when you run `MLTransform` in `read` mode, these transforms are used. This process ensures that you're applying the same preprocessing steps when you train your model and when you serve the model in production or test its accuracy."
],
"metadata": {
"id": "cokOaX2kzyke"
Expand Down

0 comments on commit 3260a7b

Please sign in to comment.