diff --git a/introduction_to_amazon_algorithms/index.rst b/introduction_to_amazon_algorithms/index.rst index a31daba595..6dda8540b0 100644 --- a/introduction_to_amazon_algorithms/index.rst +++ b/introduction_to_amazon_algorithms/index.rst @@ -73,7 +73,7 @@ Text jumpstart_sentence_pair_classification/Amazon_JumpStart_Sentence_Pair_Classification jumpstart_text_classification/Amazon_JumpStart_Text_Classification jumpstart_text_generation/Amazon_JumpStart_Text_Generation - text_classification_huggingface/Amazon_JumpStart_HF_Text_Classification + jumpstart_text_classification/Amazon_JumpStart_HuggingFace_Text_Classification Vision ------------------------------------------- diff --git a/introduction_to_amazon_algorithms/jumpstart_text_classification/Amazon_JumpStart_HuggingFace_Text_Classification.ipynb b/introduction_to_amazon_algorithms/jumpstart_text_classification/Amazon_JumpStart_HuggingFace_Text_Classification.ipynb new file mode 100644 index 0000000000..9e4b887aa5 --- /dev/null +++ b/introduction_to_amazon_algorithms/jumpstart_text_classification/Amazon_JumpStart_HuggingFace_Text_Classification.ipynb @@ -0,0 +1,1184 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "e960490f", + "metadata": {}, + "source": [ + "# Introduction to SageMaker HuggingFace - Text Classification" + ] + }, + { + "cell_type": "markdown", + "id": "96f2327e", + "metadata": {}, + "source": [ + "---\n", + "Welcome to [Amazon SageMaker Built-in Algorithms](https://sagemaker.readthedocs.io/en/stable/algorithms/index.html)! You can use SageMaker Built-in algorithms to solve many Machine Learning tasks through [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html). You can also use these algorithms through one-click in SageMaker Studio via [JumpStart](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart.html).\n", + "\n", + "In this demo notebook, we demonstrate how to use the JumpStart API for Text Classification. Text Classification refers to classifying an input sentence to one of the class labels of the training dataset. We demonstrate the following text classification tasks here:\n", + "\n", + "* How to run inference on any [Text Classification](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads) model available on HugginFace.\n", + "* How to fine-tune any pre-trained [Fill-Mask](https://huggingface.co/models?pipeline_tag=fill-mask&sort=downloads) or [Text Classification](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads) model available in HuggingFace to a custom dataset, and then run inference on the fine-tuned model.\n", + "* How to run the batch inference\n", + "\n", + "Note: This notebook was tested on ml.t3.medium instance in Amazon SageMaker Studio with Python 3 (Data Science) kernel and in Amazon SageMaker Notebook instance with conda_python3 kernel.\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "8091e1f6", + "metadata": {}, + "source": [ + "1. [Set Up](#1.-Set-Up)\n", + "2. [Select a Text-Classification Model](#2.-Select-a-Text-Classification-Model)\n", + "3. [Run inference on the pre-trained model](#3.-Run-Inference-on-the-pre-trained-model)\n", + " * [Retrieve Artifacts & Deploy an Endpoint](#3.1.-Retrieve-Artifacts-&-Deploy-an-Endpoint)\n", + " * [Example input sentences for inference](#3.2.-Example-input-sentences-for-inference)\n", + " * [Query endpoint and parse response](#3.3.-Query-endpoint-and-parse-response)\n", + " * [Clean up the endpoint](#3.4.-Clean-up-the-endpoint)\n", + "4. [Finetune the pre-trained model on a custom dataset](#4.-Fine-Tune-the-pre-trained-model-on-a-custom-dataset)\n", + " * [Retrieve Training artifacts](#4.1.-Retrieve-Training-artifacts)\n", + " * [Set Training parameters](#4.2.-Set-Training-parameters)\n", + " * [Train with Automatic Model Tuning](#4.3.-Train-with-Automatic-Model-Tuning-([HPO]))\n", + " * [Start Training](#4.4.-Start-Training)\n", + " * [Extract Training performance metrics](#4.5.-Extract-Training-performance-metrics)\n", + " * [Deploy & run Inference on the fine-tuned model](#4.6.-Deploy-&-run-Inference-on-the-fine-tuned-model)\n", + " * [Incrementally train the fine-tuned model](#4.7.-Incrementally-train-the-fine-tuned-model)\n", + "5. [Run Batch Transform](#5.-Running-Batch-Inference)\n", + " * [Prepare data for Batch Transform](#5.1.-Prepare-data-for-Batch-Transform)\n", + " * [Deploy Model for Batch Transform Job](#5.2.-Deploy-Model-for-Batch-Transform-Job)\n", + " * [Compare Predictions With the Ground Truth\n", + "](#5.3.-Compare-Predictions-With-the-Ground-Truth)" + ] + }, + { + "cell_type": "markdown", + "id": "2007b31a", + "metadata": {}, + "source": [ + "## 1. Set Up\n", + "***\n", + "Before executing the notebook, there are some initial steps required for setup.\n", + "***" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "39b943ff", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install ipywidgets==7.0.0 --quiet" + ] + }, + { + "cell_type": "markdown", + "id": "9b1051f6", + "metadata": {}, + "source": [ + "---\n", + "\n", + "To train and host on Amazon Sagemaker, we need to setup and authenticate the use of AWS services. Here, we use the execution role associated with the current notebook instance as the AWS account role with SageMaker access. It has necessary permissions, including access to your data in S3. \n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a5a3eb07", + "metadata": {}, + "outputs": [], + "source": [ + "import sagemaker, boto3, json\n", + "from sagemaker.session import Session\n", + "\n", + "sagemaker_session = Session()\n", + "aws_role = sagemaker_session.get_caller_identity_arn()\n", + "aws_region = boto3.Session().region_name\n", + "sess = sagemaker.Session()" + ] + }, + { + "cell_type": "markdown", + "id": "ee983c64", + "metadata": {}, + "source": [ + "## 2. Select a Text Classification Model\n", + "***\n", + "You can continue with the default model, or can choose a different model from the dropdown generated upon running the next cell. A complete list of JumpStart fine-tuned models can also be accessed at [JumpStart Fine-Tuned Models](https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html#).\n", + "***" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "960ca9c8", + "metadata": {}, + "outputs": [], + "source": [ + "model_id = \"huggingface-tc-bert-base-cased\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f2cd2df7", + "metadata": {}, + "outputs": [], + "source": [ + "import IPython\n", + "from ipywidgets import Dropdown\n", + "from sagemaker.jumpstart.notebook_utils import list_jumpstart_models\n", + "from sagemaker.jumpstart.filters import And\n", + "\n", + "# Retrieves all Text Classification models available by SageMaker Built-In Algorithms.\n", + "filter_value = And(\"task == tc\", \"framework == huggingface\")\n", + "tc_models = list_jumpstart_models(filter=filter_value)\n", + "# display the model-ids in a dropdown, for user to select a model.\n", + "dropdown = Dropdown(\n", + " value=model_id,\n", + " options=tc_models,\n", + " description=\"Sagemaker Pre-Trained Text Classification Models:\",\n", + " style={\"description_width\": \"initial\"},\n", + " layout={\"width\": \"max-content\"},\n", + ")\n", + "display(IPython.display.Markdown(\"## Select a pre-trained model from the dropdown below\"))\n", + "display(dropdown)" + ] + }, + { + "cell_type": "markdown", + "id": "763630b5", + "metadata": {}, + "source": [ + "### Using Models not Present in the Dropdown\n", + "***\n", + "If you want to choose any other model which is not present in the dropdown and is available at HugginFace [Text-Classification](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads) please choose huggingface-tc-models in the dropdown and pass the model_id in the HF_MODEL_ID variable. Inference on the models listed in the dropdown menu can be run in [network isolation](https://docs.aws.amazon.com/sagemaker/latest/dg/mkt-algo-model-internet-free.html) under VPC settings. However, when running inference on a model specified through HF_MODEL_ID, VPC settings with network isolation will not work.\n", + "***" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c8a23c63", + "metadata": { + "collapsed": false, + "pycharm": { + "is_executing": true + } + }, + "outputs": [], + "source": [ + "# model_version=\"*\" fetches the latest version of the model.\n", + "infer_model_id, infer_model_version = dropdown.value, \"*\"\n", + "\n", + "hub = {}\n", + "HF_MODEL_ID = \"distilbert-base-uncased-finetuned-sst-2-english\" # Pass any other HF_MODEL_ID from - https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads\n", + "if infer_model_id == \"huggingface-tc-models\":\n", + " hub[\"HF_MODEL_ID\"] = HF_MODEL_ID\n", + " hub[\"HF_TASK\"] = \"text-classification\"" + ] + }, + { + "cell_type": "markdown", + "id": "c43e49a7", + "metadata": { + "collapsed": false + }, + "source": [ + "## 3. Run Inference on the pre-trained model\n", + "***\n", + "Using SageMaker, we can perform inference on the fine-tuned model. For this example, that means on an input sentence, predicting the class label from one of the 2 classes of the [SST2](https://nlp.stanford.edu/sentiment/index.html) dataset. Otherwise predicting the class label on any of the choosen model from the HugginFace [Text-Classification](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads)\n", + "***" + ] + }, + { + "cell_type": "markdown", + "id": "c45f9242", + "metadata": { + "collapsed": false + }, + "source": [ + "### 3.1. Retrieve Artifacts & Deploy an Endpoint\n", + "***\n", + "We retrieve the deploy_image_uri, deploy_source_uri, and base_model_uri for the pre-trained model. To host the pre-trained model, we create an instance of [`sagemaker.model.Model`](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html) and deploy it.\n", + "***" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fecbe672", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from sagemaker import image_uris, model_uris, script_uris\n", + "from sagemaker.model import Model\n", + "from sagemaker.predictor import Predictor\n", + "from sagemaker.utils import name_from_base\n", + "\n", + "endpoint_name = name_from_base(f\"jumpstart-example-{infer_model_id}\")\n", + "\n", + "inference_instance_type = \"ml.p2.xlarge\"\n", + "\n", + "# Retrieve the inference docker container uri.\n", + "deploy_image_uri = image_uris.retrieve(\n", + " region=None,\n", + " framework=None,\n", + " image_scope=\"inference\",\n", + " model_id=infer_model_id,\n", + " model_version=infer_model_version,\n", + " instance_type=inference_instance_type,\n", + ")\n", + "\n", + "# Retrieve the inference script uri.\n", + "deploy_source_uri = script_uris.retrieve(\n", + " model_id=infer_model_id, model_version=infer_model_version, script_scope=\"inference\"\n", + ")\n", + "\n", + "# Retrieve the base model uri.\n", + "base_model_uri = model_uris.retrieve(\n", + " model_id=infer_model_id, model_version=infer_model_version, model_scope=\"inference\"\n", + ")\n", + "\n", + "# Create the SageMaker model instance. Note that we need to pass Predictor class when we deploy model through Model class,\n", + "# for being able to run inference through the sagemaker API.\n", + "model = Model(\n", + " image_uri=deploy_image_uri,\n", + " source_dir=deploy_source_uri,\n", + " model_data=base_model_uri,\n", + " entry_point=\"inference.py\",\n", + " role=aws_role,\n", + " predictor_cls=Predictor,\n", + " name=endpoint_name,\n", + " env=hub,\n", + ")\n", + "\n", + "# deploy the Model. TODO\n", + "base_model_predictor = model.deploy(\n", + " initial_instance_count=1,\n", + " instance_type=inference_instance_type,\n", + " endpoint_name=endpoint_name,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "524fcba5", + "metadata": { + "collapsed": false + }, + "source": [ + "### 3.2. Example input sentences for inference\n", + "***\n", + "These examples are taken from SST2 dataset downloaded from [TensorFlow](https://www.tensorflow.org/datasets/catalog/glue#gluesst2). [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). [Dataset Homepage](https://nlp.stanford.edu/sentiment/index.html). \n", + "***" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "78fdee16", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "text1 = \"astonishing ... ( frames ) profound ethical and philosophical questions in the form of dazzling pop entertainment\"\n", + "text2 = \"simply stupid , irrelevant and deeply , truly , bottomlessly cynical \"" + ] + }, + { + "cell_type": "markdown", + "id": "35924c6e", + "metadata": { + "collapsed": false + }, + "source": [ + "### 3.3. Query endpoint and parse response\n", + "***\n", + "Input to the endpoint is a single sentence. Response from the endpoint is a dictionary containing the predicted class label, and a list of class label probabilities.\n", + "***" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9e27d2f2", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "newline, bold, unbold = \"\\n\", \"\\033[1m\", \"\\033[0m\"\n", + "\n", + "\n", + "def query_endpoint(encoded_text):\n", + " response = base_model_predictor.predict(\n", + " encoded_text, {\"ContentType\": \"application/x-text\", \"Accept\": \"application/json;verbose\"}\n", + " )\n", + " return response\n", + "\n", + "\n", + "def parse_response(query_response):\n", + " model_predictions = json.loads(query_response)\n", + " probabilities, labels, predicted_label = (\n", + " model_predictions[\"probabilities\"],\n", + " model_predictions[\"labels\"],\n", + " model_predictions[\"predicted_label\"],\n", + " )\n", + " return probabilities, labels, predicted_label\n", + "\n", + "\n", + "for text in [text1, text2]:\n", + " query_response = query_endpoint(text.encode(\"utf-8\"))\n", + " probabilities, labels, predicted_label = parse_response(query_response)\n", + " print(\n", + " f\"Inference:{newline}\"\n", + " f\"Input text: '{text}'{newline}\"\n", + " f\"Model prediction: {probabilities}{newline}\"\n", + " f\"Labels: {labels}{newline}\"\n", + " f\"Predicted Label: {bold}{predicted_label}{unbold}{newline}\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "a60dbad7", + "metadata": { + "collapsed": false + }, + "source": [ + "### 3.4. Clean up the endpoint" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "45f8225e", + "metadata": { + "collapsed": false, + "pycharm": { + "is_executing": true + } + }, + "outputs": [], + "source": [ + "# Delete the SageMaker endpoint and the attached resources\n", + "base_model_predictor.delete_model()\n", + "base_model_predictor.delete_endpoint()" + ] + }, + { + "cell_type": "markdown", + "id": "70950bf9", + "metadata": { + "collapsed": false + }, + "source": [ + "## 4. Fine-Tune the pre-trained model on a custom dataset\n", + "***\n", + "### We support fine-tuning on any pre-trained model available on HugginFace [Fill-Mask](https://huggingface.co/models?pipeline_tag=fill-mask&sort=downloads) and [Text-Classification](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads). Though only the models in the dropdown list can be fine-tuned in network isolation. Please select huggingface-tc-models in the dropdown above if you can't find your choice of model to fine-tune in the dropdown list, and specify id of any model available in HugginFace [Fill-Mask](https://huggingface.co/models?pipeline_tag=fill-mask&sort=downloads) and [Text-Classification](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads), in the HF_MODEL_ID variable below.\n", + "\n", + "***" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "74a9b640", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "HF_MODEL_ID = \"distilbert-base-uncased\" # Specify the HF_MODEL_ID here from https://huggingface.co/models?pipeline_tag=fill-mask&sort=downloads or https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads" + ] + }, + { + "cell_type": "markdown", + "id": "1d1f6f8c", + "metadata": { + "collapsed": false + }, + "source": [ + "***\n", + "Previously, we saw how to run inference on a fine-tuned model. Next, we discuss how a model can be finetuned to a custom dataset with any number of classes.\n", + "\n", + "The Text Embedding model can be fine-tuned on any text classification dataset in the same way the\n", + "model available for inference has been fine-tuned on the SST2 movie review dataset.\n", + "\n", + "The model available for fine-tuning attaches a classification layer to the Text Embedding model\n", + "and initializes the layer parameters to random values.\n", + "The output dimension of the classification layer is determined based on the number of classes\n", + "detected in the input data. The fine-tuning step fine-tunes all the model\n", + "parameters to minimize prediction error on the input data and returns the fine-tuned model.\n", + "The model returned by fine-tuning can be further deployed for inference.\n", + "Below are the instructions for how the training data should be formatted for input to the model.\n", + "\n", + "\n", + "- **Input:** A directory containing a 'data.csv' file.\n", + " - Each row of the first column of 'data.csv' should have integer class labels between 0 to the number of classes.\n", + " - Each row of the second column should have the corresponding text.\n", + "- **Output:** A trained model that can be deployed for inference.\n", + "\n", + "Below is an example of 'data.csv' file showing values in its first two columns. Note that the file should not have any header.\n", + "\n", + "| | |\n", + "|---|---|\n", + "|0\t|hide new secretions from the parental units|\n", + "|0\t|contains no wit , only labored gags|\n", + "|1\t|that loves its characters and communicates something rather beautiful about human nature|\n", + "|...|...|\n", + "\n", + "SST2 dataset is downloaded from [TensorFlow](https://www.tensorflow.org/datasets/catalog/glue#gluesst2).\n", + " [Apache 2.0 License](https://jumpstart-cache-prod-us-west-2.s3-us-west-2.amazonaws.com/licenses/Apache-License/LICENSE-2.0.txt).\n", + " [Dataset Homepage](https://nlp.stanford.edu/sentiment/index.html).\n", + "***" + ] + }, + { + "cell_type": "markdown", + "id": "f4552dd0", + "metadata": { + "collapsed": false + }, + "source": [ + "### 4.1. Retrieve Training artifacts\n", + "***\n", + "Here, for the selected model, we retrieve the training docker container, the training algorithm source, the pre-trained model, and a python dictionary of the training hyper-parameters that the algorithm accepts with their default values. Note that the model_version=\"*\" fetches the latest model. Also, we do need to specify the training_instance_type to fetch train_image_uri.\n", + "***" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c3d9fd0c", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from sagemaker import image_uris, model_uris, script_uris, hyperparameters\n", + "\n", + "model_id, model_version = dropdown.value, \"*\"\n", + "training_instance_type = \"ml.p3.2xlarge\"\n", + "\n", + "# Retrieve the docker image\n", + "train_image_uri = image_uris.retrieve(\n", + " region=None,\n", + " framework=None,\n", + " model_id=model_id,\n", + " model_version=model_version,\n", + " image_scope=\"training\",\n", + " instance_type=training_instance_type,\n", + ")\n", + "# Retrieve the training script\n", + "train_source_uri = script_uris.retrieve(\n", + " model_id=model_id, model_version=model_version, script_scope=\"training\"\n", + ")\n", + "# Retrieve the pre-trained model tarball to further fine-tune\n", + "if model_id != \"huggingface-tc-models\":\n", + " train_model_uri = model_uris.retrieve(\n", + " model_id=model_id, model_version=model_version, model_scope=\"training\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "c0bfe06b", + "metadata": { + "collapsed": false + }, + "source": [ + "### 4.2. Set Training parameters\n", + "***\n", + "Now that we are done with all the setup that is needed, we are ready to fine-tune our Text Classification model. To begin, let us create a [``sageMaker.estimator.Estimator``](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) object. This estimator will launch the training job. \n", + "\n", + "There are two kinds of parameters that need to be set for training. \n", + "\n", + "The first one are the parameters for the training job. These include: (i) Training data path. This is S3 folder in which the input data is stored, (ii) Output path: This the s3 folder in which the training output is stored. (iii) Training instance type: This indicates the type of machine on which to run the training. Typically, we use GPU instances for these training. We defined the training instance type above to fetch the correct train_image_uri. \n", + "***\n", + "The second set of parameters are algorithm specific training hyper-parameters. It is also used for sepcifying the model name if we want to fine-tune on the model which is not present in the dropdown list.\n", + "***" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "036bac37", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Sample training data is available in this bucket\n", + "training_data_bucket = f\"jumpstart-cache-prod-{aws_region}\"\n", + "training_data_prefix = \"training-datasets/SST/\"\n", + "\n", + "training_dataset_s3_path = f\"s3://{training_data_bucket}/{training_data_prefix}\"\n", + "\n", + "output_bucket = sess.default_bucket()\n", + "output_prefix = \"jumpstart-example-tc-training\"\n", + "\n", + "s3_output_location = f\"s3://{output_bucket}/{output_prefix}/output\"" + ] + }, + { + "cell_type": "markdown", + "id": "8ad02cf3", + "metadata": { + "collapsed": false + }, + "source": [ + "***\n", + "For algorithm specific hyper-parameters, we start by fetching python dictionary of the training hyper-parameters that the algorithm accepts with their default values. This can then be overridden to custom values.\n", + "***" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "651f68c9", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from sagemaker import hyperparameters\n", + "\n", + "# Retrieve the default hyper-parameters for fine-tuning the model\n", + "hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)\n", + "\n", + "# [Optional] Override default hyperparameters with custom values\n", + "hyperparameters[\"batch_size\"] = \"64\"" + ] + }, + { + "cell_type": "markdown", + "id": "4e646e65", + "metadata": { + "collapsed": false + }, + "source": [ + "***\n", + "We will use the HF_MODEL_ID pased earlier here for using all the HugginFace [Fill-Mask](https://huggingface.co/models?pipeline_tag=fill-mask&sort=downloads) and [Text-Classification](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads) models.\n", + "***" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3e10f0a4", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "if model_id == \"huggingface-tc-models\":\n", + " hyperparameters[\"hub_key\"] = HF_MODEL_ID\n", + "\n", + "print(hyperparameters)" + ] + }, + { + "cell_type": "markdown", + "id": "a5051d41", + "metadata": { + "collapsed": false + }, + "source": [ + "### 4.3. Train with Automatic Model Tuning ([HPO](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)) \n", + "***\n", + "Amazon SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose. We will use a [HyperparameterTuner](https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html) object to interact with Amazon SageMaker hyperparameter tuning APIs.\n", + "***" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8c271247", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from sagemaker.tuner import ContinuousParameter\n", + "\n", + "# Use AMT for tuning and selecting the best model\n", + "use_amt = False\n", + "\n", + "# Define objective metric, based on which the best model will be selected.\n", + "amt_metric_definitions = {\n", + " \"metrics\": [{\"Name\": \"val_accuracy\", \"Regex\": \"'eval_accuracy': ([0-9\\\\.]+)\"}],\n", + " \"type\": \"Maximize\",\n", + "}\n", + "\n", + "# You can select from the hyperparameters supported by the model, and configure ranges of values to be searched for training the optimal model.(https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-ranges.html)\n", + "hyperparameter_ranges = {\n", + " \"learning_rate\": ContinuousParameter(0.00001, 0.0001, scaling_type=\"Logarithmic\")\n", + "}\n", + "\n", + "# Increase the total number of training jobs run by AMT, for increased accuracy (and training time).\n", + "max_jobs = 6\n", + "# Change parallel training jobs run by AMT to reduce total training time, constrained by your account limits.\n", + "# if max_jobs=max_parallel_jobs then Bayesian search turns to Random.\n", + "max_parallel_jobs = 2" + ] + }, + { + "cell_type": "markdown", + "id": "0d9d2622", + "metadata": { + "collapsed": false + }, + "source": [ + "### 4.4. Start Training\n", + "***\n", + "We start by creating the estimator object with all the required assets and then launch the training job.\n", + "***" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "973d923c", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from sagemaker.estimator import Estimator\n", + "from sagemaker.utils import name_from_base\n", + "from sagemaker.tuner import HyperparameterTuner\n", + "\n", + "training_job_name = name_from_base(f\"jumpstart-example-{model_id}-transfer-learning\")\n", + "\n", + "training_metric_definitions = [\n", + " {\"Name\": \"val_accuracy\", \"Regex\": \"'eval_accuracy': ([0-9\\\\.]+)\"},\n", + " {\"Name\": \"val_loss\", \"Regex\": \"'eval_loss': ([0-9\\\\.]+)\"},\n", + " {\"Name\": \"train_loss\", \"Regex\": \"'loss': ([0-9\\\\.]+)\"},\n", + " {\"Name\": \"val_f1\", \"Regex\": \"'eval_f1': ([0-9\\\\.]+)\"},\n", + " {\"Name\": \"epoch\", \"Regex\": \"'epoch': ([0-9\\\\.]+)\"},\n", + "]\n", + "\n", + "\n", + "# Create SageMaker Estimator instance\n", + "tc_estimator = Estimator(\n", + " role=aws_role,\n", + " image_uri=train_image_uri,\n", + " source_dir=train_source_uri,\n", + " model_uri=train_model_uri if model_id != \"huggingface-tc-models\" else None,\n", + " entry_point=\"transfer_learning.py\",\n", + " instance_count=1,\n", + " instance_type=training_instance_type,\n", + " max_run=360000,\n", + " hyperparameters=hyperparameters,\n", + " output_path=s3_output_location,\n", + " base_job_name=training_job_name,\n", + " metric_definitions=training_metric_definitions,\n", + ")\n", + "\n", + "if use_amt:\n", + " hp_tuner = HyperparameterTuner(\n", + " tc_estimator,\n", + " amt_metric_definitions[\"metrics\"][0][\"Name\"],\n", + " hyperparameter_ranges,\n", + " amt_metric_definitions[\"metrics\"],\n", + " max_jobs=max_jobs,\n", + " max_parallel_jobs=max_parallel_jobs,\n", + " objective_type=amt_metric_definitions[\"type\"],\n", + " base_tuning_job_name=training_job_name,\n", + " )\n", + "\n", + " # Launch a SageMaker Tuning job to search for the best hyperparameters\n", + " hp_tuner.fit({\"training\": training_dataset_s3_path})\n", + "else:\n", + " # Launch a SageMaker Training job by passing s3 path of the training data\n", + " tc_estimator.fit({\"training\": training_dataset_s3_path}, logs=True)" + ] + }, + { + "cell_type": "markdown", + "id": "97ed581b", + "metadata": { + "collapsed": false + }, + "source": [ + "### 4.5. Extract Training performance metrics\n", + "***\n", + "Performance metrics such as training loss and validation accuracy/loss can be accessed through cloudwatch while the training. We can also fetch these metrics and analyze them within the notebook\n", + "***" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ce268cd7", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from sagemaker import TrainingJobAnalytics\n", + "\n", + "if use_amt:\n", + " training_job_name = hp_tuner.best_training_job()\n", + "else:\n", + " training_job_name = tc_estimator.latest_training_job.job_name\n", + "\n", + "\n", + "df = TrainingJobAnalytics(training_job_name=training_job_name).dataframe()\n", + "df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "bd2d20f9", + "metadata": { + "collapsed": false + }, + "source": [ + "## 4.6. Deploy & run Inference on the fine-tuned model\n", + "***\n", + "A trained model does nothing on its own. We now want to use the model to perform inference. For this example, that means predicting the class label of an input sentence. We follow the same steps as in [3. Run inference on the pre-trained model](#3.-Run-inference-on-the-pre-trained-model). We start by retrieving the artifacts for deploying an endpoint. However, instead of base_predictor, we deploy the `tc_estimator` that we fine-tuned.\n", + "***" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ce738168", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "inference_instance_type = \"ml.p2.xlarge\"\n", + "\n", + "# Retrieve the inference docker container uri\n", + "deploy_image_uri = image_uris.retrieve(\n", + " region=None,\n", + " framework=None,\n", + " image_scope=\"inference\",\n", + " model_id=model_id,\n", + " model_version=model_version,\n", + " instance_type=inference_instance_type,\n", + ")\n", + "# Retrieve the inference script uri\n", + "deploy_source_uri = script_uris.retrieve(\n", + " model_id=model_id, model_version=model_version, script_scope=\"inference\"\n", + ")\n", + "\n", + "endpoint_name = name_from_base(f\"jumpstart-example-FT-{model_id}-\")\n", + "\n", + "# Use the estimator from the previous step to deploy to a SageMaker endpoint\n", + "finetuned_predictor = (hp_tuner if use_amt else tc_estimator).deploy(\n", + " initial_instance_count=1,\n", + " instance_type=inference_instance_type,\n", + " entry_point=\"inference.py\",\n", + " image_uri=deploy_image_uri,\n", + " source_dir=deploy_source_uri,\n", + " endpoint_name=endpoint_name,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "017b6e0d", + "metadata": { + "collapsed": false + }, + "source": [ + "---\n", + "Next, we input example sentences for running inference.\n", + "These examples are taken from SST2 dataset downloaded from [TensorFlow](https://www.tensorflow.org/datasets/catalog/glue#gluesst2). [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). [Dataset Homepage](https://nlp.stanford.edu/sentiment/index.html). \n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b3f4f611", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "text1 = \"astonishing ... ( frames ) profound ethical and philosophical questions in the form of dazzling pop entertainment\"\n", + "text2 = \"simply stupid , irrelevant and deeply , truly , bottomlessly cynical \"" + ] + }, + { + "cell_type": "markdown", + "id": "ea22eef2", + "metadata": { + "collapsed": false + }, + "source": [ + "---\n", + "Next, we query the finetuned model, parse the response and print the predictions.\n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "097903dd", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "newline, bold, unbold = \"\\n\", \"\\033[1m\", \"\\033[0m\"\n", + "\n", + "\n", + "def query_endpoint(encoded_text):\n", + " response = finetuned_predictor.predict(\n", + " encoded_text, {\"ContentType\": \"application/x-text\", \"Accept\": \"application/json;verbose\"}\n", + " )\n", + " return response\n", + "\n", + "\n", + "def parse_response(query_response):\n", + " model_predictions = json.loads(query_response)\n", + " probabilities, labels, predicted_label = (\n", + " model_predictions[\"probabilities\"],\n", + " model_predictions[\"labels\"],\n", + " model_predictions[\"predicted_label\"],\n", + " )\n", + " return probabilities, labels, predicted_label\n", + "\n", + "\n", + "for text in [text1, text2]:\n", + " query_response = query_endpoint(text.encode(\"utf-8\"))\n", + " probabilities, labels, predicted_label = parse_response(query_response)\n", + " print(\n", + " f\"Inference:{newline}\"\n", + " f\"Input text: '{text}'{newline}\"\n", + " f\"Model prediction: {probabilities}{newline}\"\n", + " f\"Labels: {labels}{newline}\"\n", + " f\"Predicted Label: {bold}{predicted_label}{unbold}{newline}\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "5a3d9168", + "metadata": { + "collapsed": false + }, + "source": [ + "---\n", + "Next, we clean up the deployed endpoint.\n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "49f98c21", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Delete the SageMaker endpoint and the attached resources\n", + "finetuned_predictor.delete_model()\n", + "finetuned_predictor.delete_endpoint()" + ] + }, + { + "cell_type": "markdown", + "id": "587b96ca", + "metadata": { + "collapsed": false + }, + "source": [ + "## 4.7. Incrementally train the fine-tuned model\n", + "***\n", + "Incremental training allows you to train a new model using an expanded dataset that contains an underlying pattern that was not accounted for in the previous training and which resulted in poor model performance. You can use the artifacts from an existing model and use an expanded dataset to train a new model. Incremental training saves both time and resources as you don’t need to retrain a model from scratch.\n", + "\n", + "One may use any dataset (old or new) as long as the dataset format remain the same (set of classes). Incremental training step is similar to the finetuning step discussed above with the following difference: In fine-tuning above, we start with a pre-trained model whereas in incremental training, we start with an existing fine-tuned model.\n", + "***" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "db4745a9", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# We will only do the incremental training for the fine-tuned models\n", + "if model_id == \"huggingface-tc-models\":\n", + " del hyperparameters[\"hub_key\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6d7f0b1b", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Identify the previously trained model path based on the output location where artifacts are stored previously and the training job name.\n", + "\n", + "if use_amt: # If using amt, select the model for the best training job.\n", + " sage_client = boto3.Session().client(\"sagemaker\")\n", + " tuning_job_result = sage_client.describe_hyper_parameter_tuning_job(\n", + " HyperParameterTuningJobName=hp_tuner._current_job_name\n", + " )\n", + " last_training_job_name = tuning_job_result[\"BestTrainingJob\"][\"TrainingJobName\"]\n", + "else:\n", + " last_training_job_name = tc_estimator._current_job_name\n", + "\n", + "last_trained_model_path = f\"{s3_output_location}/{last_training_job_name}/output/model.tar.gz\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f52446cc", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "incremental_train_output_prefix = \"jumpstart-example-ic-incremental-training\"\n", + "\n", + "incremental_s3_output_location = f\"s3://{output_bucket}/{incremental_train_output_prefix}/output\"\n", + "\n", + "incremental_training_job_name = name_from_base(f\"jumpstart-example-{model_id}-incremental-training\")\n", + "\n", + "\n", + "incremental_train_estimator = Estimator(\n", + " role=aws_role,\n", + " image_uri=train_image_uri,\n", + " source_dir=train_source_uri,\n", + " model_uri=last_trained_model_path,\n", + " entry_point=\"transfer_learning.py\",\n", + " instance_count=1,\n", + " instance_type=training_instance_type,\n", + " max_run=360000,\n", + " hyperparameters=hyperparameters,\n", + " output_path=incremental_s3_output_location,\n", + " base_job_name=incremental_training_job_name,\n", + " metric_definitions=training_metric_definitions,\n", + ")\n", + "\n", + "incremental_train_estimator.fit({\"training\": training_dataset_s3_path}, logs=True)" + ] + }, + { + "cell_type": "markdown", + "id": "029804c2", + "metadata": { + "collapsed": false + }, + "source": [ + "Once trained, we can use the same steps as in 4.6. Deploy & run Inference on the fine-tuned model to deploy the model." + ] + }, + { + "cell_type": "markdown", + "id": "3a75f30b", + "metadata": { + "collapsed": false + }, + "source": [ + "### 5. Run Batch Transform\n", + "***\n", + "Using SageMaker, we can perform batch inference on the fine-tuned model for large datasets. For this example, that means on an input sentence, predicting the class label from one of the 2 classes of the [SST2](https://nlp.stanford.edu/sentiment/index.html) dataset. \n", + "- Batch Inference is useful in the following scenarios:\n", + " - Preprocess datasets to remove noise or bias that interferes with training or inference from your dataset.\n", + " - Get inferences from large datasets.\n", + " - Run inference when you don't need a persistent endpoint.\n", + " - Associate input records with inferences to assist the interpretation of results.\n", + "\n", + "Below is an example of 'test.csv' file showing input sentences. Note that the file should not have any header.\n", + "\n", + "| |\n", + "|---|\n", + "|hide new secretions from the parental units|\n", + "|contains no wit , only labored gags|\n", + "|that loves its characters and communicates something rather beautiful about human nature|\n", + "|...|\n", + "***" + ] + }, + { + "cell_type": "markdown", + "id": "be4ff6d7", + "metadata": { + "collapsed": false + }, + "source": [ + "### 5.1. Prepare data for Batch Transform\n", + "***\n", + "We will use the tiny [SST2](https://nlp.stanford.edu/sentiment/index.html) dataset for running the batch inference. We will download the data locally, remove the labels and upload it to S3 for batch inference.\n", + "***" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ac0ab360", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import os\n", + "import ast\n", + "\n", + "s3 = boto3.client(\"s3\")\n", + "training_data_tiny_prefix = \"training-datasets/SST-tiny/\"\n", + "s3.download_file(training_data_bucket, training_data_tiny_prefix + \"data.csv\", \"data.csv\")\n", + "train_data = pd.read_csv(\"./data.csv\", header=None, names=[\"label\", \"sentence\"])\n", + "train_data.head(5)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c4c410f6", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "test_data = train_data[[\"sentence\"]]\n", + "test_data.to_csv(\"test_data.csv\", header=False, index=False)\n", + "input_path = f\"s3://{output_bucket}/{output_prefix}/test/\"\n", + "output_path = f\"s3://{output_bucket}/{output_prefix}/batch_output/\"\n", + "s3.upload_file(\"test_data.csv\", output_bucket, os.path.join(output_prefix + \"/test/data.csv\"))" + ] + }, + { + "cell_type": "markdown", + "id": "8dade622", + "metadata": { + "collapsed": false + }, + "source": [ + "### 5.2. Deploy Model for Batch Transform Job\n", + "***\n", + "We will use the deploy_image_uri, deploy_source_uri, and base_model_uri of the pre-trained model defined in section 3 for deploying the model for Batch Inference. To host the pre-trained model, we create an instance of [`sagemaker.model.Model`](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html) and deploy it.\n", + "***" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ed1a2335", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Create the SageMaker model instance. Note that we need to pass Predictor class when we deploy model through Model class,\n", + "# for being able to run inference through the sagemaker API.\n", + "model = Model(\n", + " image_uri=deploy_image_uri,\n", + " source_dir=deploy_source_uri,\n", + " model_data=base_model_uri,\n", + " entry_point=\"inference.py\",\n", + " role=aws_role,\n", + " predictor_cls=Predictor,\n", + ")\n", + "\n", + "# Creating the Batch transformer object\n", + "batch_transformer = model.transformer(\n", + " instance_count=1,\n", + " instance_type=inference_instance_type,\n", + " output_path=output_path,\n", + " assemble_with=\"Line\",\n", + " accept=\"text/csv;verbose\",\n", + " max_payload=1,\n", + ")\n", + "\n", + "# Making the predications on the input data\n", + "batch_transformer.transform(input_path, content_type=\"text/csv\", split_type=\"Line\")\n", + "\n", + "batch_transformer.wait()" + ] + }, + { + "cell_type": "markdown", + "id": "cc0ce94e", + "metadata": {}, + "source": [ + "### 5.3. Compare Predictions With the Ground Truth\n", + "***\n", + "We will compare the predictions on tiny [SST2](https://nlp.stanford.edu/sentiment/index.html) data with the actual labels.\n", + "***" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0f68e9b4", + "metadata": {}, + "outputs": [], + "source": [ + "s3.download_file(\n", + " output_bucket, output_prefix + \"/batch_output/\" + \"data.csv.out\", \"predict.csv.out\"\n", + ")\n", + "import ast\n", + "\n", + "with open(\"predict.csv.out\", \"r\") as predict_file:\n", + " predict_all = [ast.literal_eval(line.rstrip()) for line in predict_file]\n", + "\n", + "data_size = len(test_data)\n", + "df_predict = pd.DataFrame(predict_all)\n", + "df_predict[\"predicted_label\"] = df_predict[\"predicted_label\"].str[-1].astype(int)\n", + "accuracy = (\n", + " sum(\n", + " train_data.loc[: data_size - 1, \"label\"]\n", + " == df_predict.loc[: data_size - 1, \"predicted_label\"]\n", + " )\n", + " / data_size\n", + ")\n", + "\n", + "print(\"The accuracy of the model on the SST2 tiny data is: \", accuracy)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "conda_python3", + "language": "python", + "name": "conda_python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/introduction_to_amazon_algorithms/jumpstart_text_classification/README.md b/introduction_to_amazon_algorithms/jumpstart_text_classification/README.md index dc1bf07d41..a0d799c39d 100644 --- a/introduction_to_amazon_algorithms/jumpstart_text_classification/README.md +++ b/introduction_to_amazon_algorithms/jumpstart_text_classification/README.md @@ -1,2 +1,4 @@ ### SageMaker JumpStart Text Classification Training & Deployment -This notebook `Amazon_JumpStart_Text_Classification.ipynb` for Text Classification demonstrates (a) how to use a pre-trained model available in JumpStart for inference, and (b) how to fine-tune a pre-trained Transformer model on a custom dataset using JumpStart transfer learning algorithm, using AMT (Automatic Model Tuning) to search for the best hyperparameters , and how to use the fine-tuned model for inference. \ No newline at end of file +We have two notebooks for text classification. Their description is as follows - \ +(1). The notebook `Amazon_JumpStart_Text_Classification.ipynb` uses tensorflow models for Text Classification and demonstrates (a) how to use a pre-trained model available in JumpStart for inference, and (b) how to fine-tune a pre-trained Transformer model on a custom dataset using JumpStart transfer learning algorithm, using AMT (Automatic Model Tuning) to search for the best hyperparameters , and how to use the fine-tuned model for inference.\ +(2). This notebook `Amazon_JumpStart_HuggingFace_Text_Classification.ipynb` uses huggingface models for Text Classification and demonstrates (a) how to use a pre-trained model available in JumpStart for inference, (b) how to fine-tune a pre-trained Transformer model on a custom dataset using JumpStart transfer learning algorithm, using AMT (Automatic Model Tuning) to search for the best hyperparameters , and how to use the fine-tuned model for inference, and (c) how to run batch inference \ No newline at end of file