From e846e4c1aab8829981eb1924378b6127ac304089 Mon Sep 17 00:00:00 2001 From: jy Date: Mon, 4 Mar 2024 00:01:24 +0900 Subject: [PATCH] update: lab02 --- docs/index.html | 2 +- docs/posts/labs/lab02.html | 42 +++++++++++++++++++++----------------- docs/search.json | 8 ++++---- posts/labs/lab02.ipynb | 42 +++++++++++++------------------------- 4 files changed, 42 insertions(+), 52 deletions(-) diff --git a/docs/index.html b/docs/index.html index f9eba3f..9a523e4 100644 --- a/docs/index.html +++ b/docs/index.html @@ -205,7 +205,7 @@
Categories
-
+
@@ -294,7 +293,7 @@

👩‍💻 Lab 2

Lab 2: Quantization

Goals

-

이 과제에서는 모델 크기와 지연 시간을 줄이기 위해 클래식한 neural network modelquantizing하는 연습을 할 것입니다. 이 과제의 목표는 다음과 같습니다:

+

이번 실습에서는 모델 크기와 지연 시간을 줄이기 위해 클래식한 neural network modelquantizing하는 연습을 할 것입니다. 이 실습의 목표는 다음과 같습니다:

  • Quantization의 기본 개념을 이해합니다.
  • k-means quantization을 구현하고 적용합니다.
  • @@ -308,7 +307,12 @@

    Goals

    Contents

    주요 섹션은 K-Means QuantizationLinear Quantization 2가지로 구성되어 있습니다.

    -

    이번 실습 노트에서 총 10개의 질문을 통해 학습하게 됩니다.: - K-Means Quantization에 대해서는 3개의 질문이 있습니다 (질문 1-3). - Linear Quantization에 대해서는 6개의 질문이 있습니다 (질문 4-9). - 질문 10은 k-means quantization과 linear quantization을 비교합니다.

    +

    이번 실습 노트에서 총 10개의 질문을 통해 학습하게 됩니다.:

    +
      +
    • K-Means Quantization에 대해서는 3개의 질문이 있습니다 (Question 1-3).
    • +
    • Linear Quantization에 대해서는 6개의 질문이 있습니다 (Question 4-9).
    • +
    • Question 10은 k-means quantization과 linear quantization을 비교합니다.
    • +

    실습노트에 대한 설정 부분(Setup)은 Colaboratory Note를 열면 확인하실 수 있습니다. 포스팅에서는 보다 실습내용에 집중할 수 있도록 생략되어 있습니다.

    @@ -346,7 +350,11 @@

    K-Means Quantization

    quantized_weight = codebook.centroids[codebook.labels].view_as(weight)

    \(n\)-bit k-means quantization은 시냅스를 \(2^n\) 개의 클러스터로 나누고, 동일한 클러스터 내의 시냅스는 동일한 가중치 값을 공유하게 됩니다.

    -

    따라서, k-means quantization은 다음과 같은 codebook을 생성합니다: * centroids: \(2^n\) fp32 클러스터 중심. * labels: 원래 fp32 가중치 텐서와 동일한 #elements를 가진 \(n\)-bit 정수 텐서. 각 정수는 해당 클러스터가 어디에 속하는지를 나타냅니다.

    +

    따라서, k-means quantization은 다음과 같은 codebook을 생성합니다:

    +
      +
    • centroids: \(2^n\) fp32 클러스터 중심.
    • +
    • labels: 원래 fp32 가중치 텐서와 동일한 #elements를 가진 \(n\)-bit 정수 텐서. 각 정수는 해당 클러스터가 어디에 속하는지를 나타냅니다.
    • +

    추론하는 동안, codebook을 기반으로 한 fp32 텐서가 추론을 위해 생성됩니다:

    quantized_weight = codebook.centroids[codebook.labels].view_as(weight)

    @@ -1009,9 +1017,7 @@

    Quantized Inference\(Z_{\mathrm{weight}}=0\)이므로, \(r_{\mathrm{weight}} = S_{\mathrm{weight}}q_{\mathrm{weight}}\)입니다.

    부동 소수점 convolution은 다음과 같이 작성할 수 있습니다.

    -

    \(r_{\mathrm{output}} = \mathrm{CONV}[r_{\mathrm{input}}, r_{\mathrm{weight}}] + r_{\mathrm{bias}}\\ -\;\;\;\;\;\;\;\;= \mathrm{CONV}[S_{\mathrm{input}}(q_{\mathrm{input}}-Z_{\mathrm{input}}), S_{\mathrm{weight}}q_{\mathrm{weight}}] + S_{\mathrm{bias}}(q_{\mathrm{bias}}-Z_{\mathrm{bias}})\\ -\;\;\;\;\;\;\;\;= \mathrm{CONV}[q_{\mathrm{input}}-Z_{\mathrm{input}}, q_{\mathrm{weight}}]\cdot (S_{\mathrm{input}} \cdot S_{\mathrm{weight}}) + S_{\mathrm{bias}}(q_{\mathrm{bias}}-Z_{\mathrm{bias}})\)

    +

    \(r_{\mathrm{output}} = \mathrm{CONV}[r_{\mathrm{input}}, r_{\mathrm{weight}}] + r_{\mathrm{bias}}\)\ \(\;\;\;\;\;\;\;\;= \mathrm{CONV}[S_{\mathrm{input}}(q_{\mathrm{input}}-Z_{\mathrm{input}}), S_{\mathrm{weight}}q_{\mathrm{weight}}] + S_{\mathrm{bias}}(q_{\mathrm{bias}}-Z_{\mathrm{bias}})\)\ \(\;\;\;\;\;\;\;\;= \mathrm{CONV}[q_{\mathrm{input}}-Z_{\mathrm{input}}, q_{\mathrm{weight}}]\cdot (S_{\mathrm{input}} \cdot S_{\mathrm{weight}}) + S_{\mathrm{bias}}(q_{\mathrm{bias}}-Z_{\mathrm{bias}})\)

    계산을 더 간단하게 하기 위해

    @@ -1020,7 +1026,7 @@

    Quantized Inference

    로 설정하여,

    -

    \(r_{\mathrm{output}} = (\mathrm{CONV}[q_{\mathrm{input}}-Z_{\mathrm{input}}, q_{\mathrm{weight}}] + q_{\mathrm{bias}})\cdot (S_{\mathrm{input}} \cdot S_{\mathrm{weight}})\) \(\;\;\;\;\;\;\;\;= (\mathrm{CONV}[q_{\mathrm{input}}, q_{\mathrm{weight}}] - \mathrm{CONV}[Z_{\mathrm{input}}, q_{\mathrm{weight}}] + q_{\mathrm{bias}})\cdot (S_{\mathrm{input}}S_{\mathrm{weight}})\)

    +

    \(r_{\mathrm{output}} = (\mathrm{CONV}[q_{\mathrm{input}}-Z_{\mathrm{input}}, q_{\mathrm{weight}}] + q_{\mathrm{bias}})\cdot (S_{\mathrm{input}} \cdot S_{\mathrm{weight}})\) \ \(\;\;\;\;\;\;\;\;\;= (\mathrm{CONV}[q_{\mathrm{input}}, q_{\mathrm{weight}}] - \mathrm{CONV}[Z_{\mathrm{input}}, q_{\mathrm{weight}}] + q_{\mathrm{bias}})\cdot (S_{\mathrm{input}}S_{\mathrm{weight}})\)

    이며,

    @@ -1525,12 +1531,14 @@

    Question 9.1 (5 pts) # hint: you need to convert the original fp32 input of range (0, 1) # into int8 format of range (-128, 127) ############### YOUR CODE STARTS HERE ############### - return x.clamp(-128, 127).to(torch.int8) - ############### YOUR CODE ENDS HERE ################# - -int8_model_accuracy = evaluate(quantized_model, dataloader['test'], - extra_preprocess=[extra_preprocess]) -print(f"int8 model has accuracy={int8_model_accuracy:.2f}%")

+ x_scaled = x * 255 + x_shifted = x_scaled - 128 + return x_shifted.clamp(-128, 127).to(torch.int8) + ############### YOUR CODE ENDS HERE ################# + +int8_model_accuracy = evaluate(quantized_model, dataloader['test'], + extra_preprocess=[extra_preprocess]) +print(f"int8 model has accuracy={int8_model_accuracy:.2f}%")
VGG(
   (backbone): Sequential(
@@ -1608,13 +1616,9 @@ 

요약:

  • 선형 양자화는 단순성, 속도 및 광범위한 하드웨어 호환성의 균형을 제공하여, 복잡하거나 균일하지 않은 데이터 분포에 대해 동일한 수준의 정확도를 항상 달성할 수는 없지만 실시간 처리와 처리 능력이 제한된 장치에 적합합니다.
  • 응용 프로그램의 특정 요구 사항에 따라 K-means 기반 양자화와 선형 양자화 사이에서 선택해야 하며, 정확성, 처리 지연 시간 및 사용 가능한 계산 리소스의 중요성을 고려해야 합니다.

    - - -
    -

    Feedback

    -

    Please fill out this feedback form when you finished this lab. We would love to hear your thoughts or feedback on how we can improve this lab!

    +
    diff --git a/docs/search.json b/docs/search.json index 07b027f..dba8b37 100644 --- a/docs/search.json +++ b/docs/search.json @@ -81,14 +81,14 @@ "href": "posts/labs/lab02.html#goals", "title": "👩‍💻 Lab 2", "section": "Goals", - "text": "Goals\n이 과제에서는 모델 크기와 지연 시간을 줄이기 위해 클래식한 neural network model을 quantizing하는 연습을 할 것입니다. 이 과제의 목표는 다음과 같습니다:\n\nQuantization의 기본 개념을 이해합니다.\nk-means quantization을 구현하고 적용합니다.\nk-means quantization에 대해 quantization-aware training을 구현하고 적용합니다.\nlinear quantization을 구현하고 적용합니다.\nlinear quantization에 대해 integer-only inference를 구현하고 적용합니다.\nQuantization에서의 성능 개선(예: 속도 향상)에 대한 기본적인 이해를 얻습니다.\n이러한 quantization 접근 방식 사이의 차이점과 트레이드오프를 이해합니다." + "text": "Goals\n이번 실습에서는 모델 크기와 지연 시간을 줄이기 위해 클래식한 neural network model을 quantizing하는 연습을 할 것입니다. 이 실습의 목표는 다음과 같습니다:\n\nQuantization의 기본 개념을 이해합니다.\nk-means quantization을 구현하고 적용합니다.\nk-means quantization에 대해 quantization-aware training을 구현하고 적용합니다.\nlinear quantization을 구현하고 적용합니다.\nlinear quantization에 대해 integer-only inference를 구현하고 적용합니다.\nQuantization에서의 성능 개선(예: 속도 향상)에 대한 기본적인 이해를 얻습니다.\n이러한 quantization 접근 방식 사이의 차이점과 트레이드오프를 이해합니다." }, { "objectID": "posts/labs/lab02.html#contents", "href": "posts/labs/lab02.html#contents", "title": "👩‍💻 Lab 2", "section": "Contents", - "text": "Contents\n주요 섹션은 K-Means Quantization 과 Linear Quantization 2가지로 구성되어 있습니다.\n이번 실습 노트에서 총 10개의 질문을 통해 학습하게 됩니다.: - K-Means Quantization에 대해서는 3개의 질문이 있습니다 (질문 1-3). - Linear Quantization에 대해서는 6개의 질문이 있습니다 (질문 4-9). - 질문 10은 k-means quantization과 linear quantization을 비교합니다.\n\n실습노트에 대한 설정 부분(Setup)은 Colaboratory Note를 열면 확인하실 수 있습니다. 포스팅에서는 보다 실습내용에 집중할 수 있도록 생략되어 있습니다.\n\n\n먼저 FP32 Model의 정확도와 모델 크기를 평가해봅시다\n\nfp32_model_accuracy = evaluate(model, dataloader['test'])\nfp32_model_size = get_model_size(model)\nprint(f\"fp32 model has accuracy={fp32_model_accuracy:.2f}%\")\nprint(f\"fp32 model has size={fp32_model_size/MiB:.2f} MiB\")\n\n\n\n\nfp32 model has accuracy=92.95%\nfp32 model has size=35.20 MiB" + "text": "Contents\n주요 섹션은 K-Means Quantization 과 Linear Quantization 2가지로 구성되어 있습니다.\n이번 실습 노트에서 총 10개의 질문을 통해 학습하게 됩니다.:\n\nK-Means Quantization에 대해서는 3개의 질문이 있습니다 (Question 1-3).\nLinear Quantization에 대해서는 6개의 질문이 있습니다 (Question 4-9).\nQuestion 10은 k-means quantization과 linear quantization을 비교합니다.\n\n\n실습노트에 대한 설정 부분(Setup)은 Colaboratory Note를 열면 확인하실 수 있습니다. 포스팅에서는 보다 실습내용에 집중할 수 있도록 생략되어 있습니다.\n\n\n먼저 FP32 Model의 정확도와 모델 크기를 평가해봅시다\n\nfp32_model_accuracy = evaluate(model, dataloader['test'])\nfp32_model_size = get_model_size(model)\nprint(f\"fp32 model has accuracy={fp32_model_accuracy:.2f}%\")\nprint(f\"fp32 model has size={fp32_model_size/MiB:.2f} MiB\")\n\n\n\n\nfp32 model has accuracy=92.95%\nfp32 model has size=35.20 MiB" }, { "objectID": "posts/labs/lab02.html#question-1-10-pts", @@ -151,14 +151,14 @@ "href": "posts/labs/lab02.html#quantized-inference", "title": "👩‍💻 Lab 2", "section": "Quantized Inference", - "text": "Quantized Inference\n양자화 후, convolution 및 fully-connected layer의 추론도 변경됩니다.\n\\(r = S(q-Z)\\)를 상기해 보면, 다음과 같습니다.\n\n\\(r_{\\mathrm{input}} = S_{\\mathrm{input}}(q_{\\mathrm{input}}-Z_{\\mathrm{input}})\\)\n\\(r_{\\mathrm{weight}} = S_{\\mathrm{weight}}(q_{\\mathrm{weight}}-Z_{\\mathrm{weight}})\\)\n\\(r_{\\mathrm{bias}} = S_{\\mathrm{bias}}(q_{\\mathrm{bias}}-Z_{\\mathrm{bias}})\\)\n\n\\(Z_{\\mathrm{weight}}=0\\)이므로, \\(r_{\\mathrm{weight}} = S_{\\mathrm{weight}}q_{\\mathrm{weight}}\\)입니다.\n부동 소수점 convolution은 다음과 같이 작성할 수 있습니다.\n\n\\(r_{\\mathrm{output}} = \\mathrm{CONV}[r_{\\mathrm{input}}, r_{\\mathrm{weight}}] + r_{\\mathrm{bias}}\\\\\n\\;\\;\\;\\;\\;\\;\\;\\;= \\mathrm{CONV}[S_{\\mathrm{input}}(q_{\\mathrm{input}}-Z_{\\mathrm{input}}), S_{\\mathrm{weight}}q_{\\mathrm{weight}}] + S_{\\mathrm{bias}}(q_{\\mathrm{bias}}-Z_{\\mathrm{bias}})\\\\\n\\;\\;\\;\\;\\;\\;\\;\\;= \\mathrm{CONV}[q_{\\mathrm{input}}-Z_{\\mathrm{input}}, q_{\\mathrm{weight}}]\\cdot (S_{\\mathrm{input}} \\cdot S_{\\mathrm{weight}}) + S_{\\mathrm{bias}}(q_{\\mathrm{bias}}-Z_{\\mathrm{bias}})\\)\n\n계산을 더 간단하게 하기 위해\n\n\\(Z_{\\mathrm{bias}} = 0\\)\n\\(S_{\\mathrm{bias}} = S_{\\mathrm{input}} \\cdot S_{\\mathrm{weight}}\\)\n\n로 설정하여,\n\n\\(r_{\\mathrm{output}} = (\\mathrm{CONV}[q_{\\mathrm{input}}-Z_{\\mathrm{input}}, q_{\\mathrm{weight}}] + q_{\\mathrm{bias}})\\cdot (S_{\\mathrm{input}} \\cdot S_{\\mathrm{weight}})\\) \\(\\;\\;\\;\\;\\;\\;\\;\\;= (\\mathrm{CONV}[q_{\\mathrm{input}}, q_{\\mathrm{weight}}] - \\mathrm{CONV}[Z_{\\mathrm{input}}, q_{\\mathrm{weight}}] + q_{\\mathrm{bias}})\\cdot (S_{\\mathrm{input}}S_{\\mathrm{weight}})\\)\n\n이며,\n\n\\(r_{\\mathrm{output}} = S_{\\mathrm{output}}(q_{\\mathrm{output}}-Z_{\\mathrm{output}})\\)\n\n이므로\n\n\\(S_{\\mathrm{output}}(q_{\\mathrm{output}}-Z_{\\mathrm{output}}) = (\\mathrm{CONV}[q_{\\mathrm{input}}, q_{\\mathrm{weight}}] - \\mathrm{CONV}[Z_{\\mathrm{input}}, q_{\\mathrm{weight}}] + q_{\\mathrm{bias}})\\cdot (S_{\\mathrm{input}} S_{\\mathrm{weight}})\\)\n\n따라서\n\n\\(q_{\\mathrm{output}} = (\\mathrm{CONV}[q_{\\mathrm{input}}, q_{\\mathrm{weight}}] - \\mathrm{CONV}[Z_{\\mathrm{input}}, q_{\\mathrm{weight}}] + q_{\\mathrm{bias}})\\cdot (S_{\\mathrm{input}}S_{\\mathrm{weight}} / S_{\\mathrm{output}}) + Z_{\\mathrm{output}}\\)\n\n\\(Z_{\\mathrm{input}}\\), \\(q_{\\mathrm{weight}}\\), \\(q_{\\mathrm{bias}}\\)는 추론 전에 결정되므로,\n\n\\(Q_{\\mathrm{bias}} = q_{\\mathrm{bias}} - \\mathrm{CONV}[Z_{\\mathrm{input}}, q_{\\mathrm{weight}}]\\)\n\n로 설정하면,\n\n\\(q_{\\mathrm{output}} = (\\mathrm{Linear}[q_{\\mathrm{input}}, q_{\\mathrm{weight}}] + Q_{\\mathrm{bias}})\\cdot (S_{\\mathrm{input}} \\cdot S_{\\mathrm{weight}} / S_{\\mathrm{output}}) + Z_{\\mathrm{output}}\\)\n\n\nQuestion 6 (5 pts)\nbias를 linear quantizing하는 함수를 완성하세요.\nHint:\n위의 추론과정에서 아래와 같은 수식을 얻었습니다.\n\n\\(Z_{\\mathrm{bias}} = 0\\)\n\\(S_{\\mathrm{bias}} = S_{\\mathrm{input}} \\cdot S_{\\mathrm{weight}}\\)\n\n\ndef linear_quantize_bias_per_output_channel(bias, weight_scale, input_scale):\n \"\"\"\n linear quantization for single bias tensor\n quantized_bias = fp_bias / bias_scale\n :param bias: [torch.FloatTensor] bias weight to be quantized\n :param weight_scale: [float or torch.FloatTensor] weight scale tensor\n :param input_scale: [float] input scale\n :return:\n [torch.IntTensor] quantized bias tensor\n \"\"\"\n assert(bias.dim() == 1)\n assert(bias.dtype == torch.float)\n assert(isinstance(input_scale, float))\n if isinstance(weight_scale, torch.Tensor):\n assert(weight_scale.dtype == torch.float)\n weight_scale = weight_scale.view(-1)\n assert(bias.numel() == weight_scale.numel())\n\n ############### YOUR CODE STARTS HERE ###############\n bias_scale = weight_scale * input_scale\n ############### YOUR CODE ENDS HERE #################\n\n quantized_bias = linear_quantize(bias, 32, bias_scale,\n zero_point=0, dtype=torch.int32)\n return quantized_bias, bias_scale, 0\n\n\n\nQuantized Fully-Connected Layer\n양자화된 fully-connected layer의 경우, \\(Q_{\\mathrm{bias}}\\)를 먼저 계산합니다. \\(Q_{\\mathrm{bias}} = q_{\\mathrm{bias}} - \\mathrm{Linear}[Z_{\\mathrm{input}}, q_{\\mathrm{weight}}]\\)를 기억하세요.\n\ndef shift_quantized_linear_bias(quantized_bias, quantized_weight, input_zero_point):\n \"\"\"\n shift quantized bias to incorporate input_zero_point for nn.Linear\n shifted_quantized_bias = quantized_bias - Linear(input_zero_point, quantized_weight)\n :param quantized_bias: [torch.IntTensor] quantized bias (torch.int32)\n :param quantized_weight: [torch.CharTensor] quantized weight (torch.int8)\n :param input_zero_point: [int] input zero point\n :return:\n [torch.IntTensor] shifted quantized bias tensor\n \"\"\"\n assert(quantized_bias.dtype == torch.int32)\n assert(isinstance(input_zero_point, int))\n return quantized_bias - quantized_weight.sum(1).to(torch.int32) * input_zero_point\n\n\nQuestion 7 (15 pts)\n아래의 양자화된 fully-connected layer inference function를 완성하세요.\nHint:\n\n\\(q_{\\mathrm{output}} = (\\mathrm{Linear}[q_{\\mathrm{input}}, q_{\\mathrm{weight}}] + Q_{\\mathrm{bias}})\\cdot (S_{\\mathrm{input}} S_{\\mathrm{weight}} / S_{\\mathrm{output}}) + Z_{\\mathrm{output}}\\)\n\n\ndef quantized_linear(input, weight, bias, feature_bitwidth, weight_bitwidth,\n input_zero_point, output_zero_point,\n input_scale, weight_scale, output_scale):\n \"\"\"\n quantized fully-connected layer\n :param input: [torch.CharTensor] quantized input (torch.int8)\n :param weight: [torch.CharTensor] quantized weight (torch.int8)\n :param bias: [torch.IntTensor] shifted quantized bias or None (torch.int32)\n :param feature_bitwidth: [int] quantization bit width of input and output\n :param weight_bitwidth: [int] quantization bit width of weight\n :param input_zero_point: [int] input zero point\n :param output_zero_point: [int] output zero point\n :param input_scale: [float] input feature scale\n :param weight_scale: [torch.FloatTensor] weight per-channel scale\n :param output_scale: [float] output feature scale\n :return:\n [torch.CharIntTensor] quantized output feature (torch.int8)\n \"\"\"\n assert(input.dtype == torch.int8)\n assert(weight.dtype == input.dtype)\n assert(bias is None or bias.dtype == torch.int32)\n assert(isinstance(input_zero_point, int))\n assert(isinstance(output_zero_point, int))\n assert(isinstance(input_scale, float))\n assert(isinstance(output_scale, float))\n assert(weight_scale.dtype == torch.float)\n\n # Step 1: integer-based fully-connected (8-bit multiplication with 32-bit accumulation)\n if 'cpu' in input.device.type:\n # use 32-b MAC for simplicity\n output = torch.nn.functional.linear(input.to(torch.int32), weight.to(torch.int32), bias)\n else:\n # current version pytorch does not yet support integer-based linear() on GPUs\n output = torch.nn.functional.linear(input.float(), weight.float(), bias.float())\n\n ############### YOUR CODE STARTS HERE ###############\n # Step 2: scale the output\n # hint: 1. scales are floating numbers, we need to convert output to float as well\n # 2. the shape of weight scale is [oc, 1, 1, 1] while the shape of output is [batch_size, oc]\n real_scale = input_scale * weight_scale.view(-1) / output_scale\n output = output.float() * real_scale\n\n # Step 3: Shift output by output_zero_point\n output += output_zero_point\n ############### YOUR CODE STARTS HERE ###############\n\n # Make sure all value lies in the bitwidth-bit range\n output = output.round().clamp(*get_quantized_range(feature_bitwidth)).to(torch.int8)\n return output\n\nLet’s verify the functionality of defined quantized fully connected layer.\n\ntest_quantized_fc()\n\n* Test quantized_fc()\n target bitwidth: 2 bits\n batch size: 4\n input channels: 8\n output channels: 8\n* Test passed.\n\n\n\n\n\n\n\n\n\n\n\n\nQuantized Convolution\n양자화된 컨볼루션 레이어의 경우, 먼저 \\(Q_{\\mathrm{bias}}\\)를 계산합니다. \\(Q_{\\mathrm{bias}} = q_{\\mathrm{bias}} - \\mathrm{CONV}[Z_{\\mathrm{input}}, q_{\\mathrm{weight}}]\\)를 기억하세요.\n\ndef shift_quantized_conv2d_bias(quantized_bias, quantized_weight, input_zero_point):\n \"\"\"\n shift quantized bias to incorporate input_zero_point for nn.Conv2d\n shifted_quantized_bias = quantized_bias - Conv(input_zero_point, quantized_weight)\n :param quantized_bias: [torch.IntTensor] quantized bias (torch.int32)\n :param quantized_weight: [torch.CharTensor] quantized weight (torch.int8)\n :param input_zero_point: [int] input zero point\n :return:\n [torch.IntTensor] shifted quantized bias tensor\n \"\"\"\n assert(quantized_bias.dtype == torch.int32)\n assert(isinstance(input_zero_point, int))\n return quantized_bias - quantized_weight.sum((1,2,3)).to(torch.int32) * input_zero_point\n\n\nQuestion 8 (15 pts)\n아래의 quantized convolution function을 완성하세요.\nHint: > \\(q_{\\mathrm{output}} = (\\mathrm{CONV}[q_{\\mathrm{input}}, q_{\\mathrm{weight}}] + Q_{\\mathrm{bias}}) \\cdot (S_{\\mathrm{input}}S_{\\mathrm{weight}} / S_{\\mathrm{output}}) + Z_{\\mathrm{output}}\\)\n\ndef quantized_conv2d(input, weight, bias, feature_bitwidth, weight_bitwidth,\n input_zero_point, output_zero_point,\n input_scale, weight_scale, output_scale,\n stride, padding, dilation, groups):\n \"\"\"\n quantized 2d convolution\n :param input: [torch.CharTensor] quantized input (torch.int8)\n :param weight: [torch.CharTensor] quantized weight (torch.int8)\n :param bias: [torch.IntTensor] shifted quantized bias or None (torch.int32)\n :param feature_bitwidth: [int] quantization bit width of input and output\n :param weight_bitwidth: [int] quantization bit width of weight\n :param input_zero_point: [int] input zero point\n :param output_zero_point: [int] output zero point\n :param input_scale: [float] input feature scale\n :param weight_scale: [torch.FloatTensor] weight per-channel scale\n :param output_scale: [float] output feature scale\n :return:\n [torch.(cuda.)CharTensor] quantized output feature\n \"\"\"\n assert(len(padding) == 4)\n assert(input.dtype == torch.int8)\n assert(weight.dtype == input.dtype)\n assert(bias is None or bias.dtype == torch.int32)\n assert(isinstance(input_zero_point, int))\n assert(isinstance(output_zero_point, int))\n assert(isinstance(input_scale, float))\n assert(isinstance(output_scale, float))\n assert(weight_scale.dtype == torch.float)\n\n # Step 1: calculate integer-based 2d convolution (8-bit multiplication with 32-bit accumulation)\n input = torch.nn.functional.pad(input, padding, 'constant', input_zero_point)\n if 'cpu' in input.device.type:\n # use 32-b MAC for simplicity\n output = torch.nn.functional.conv2d(input.to(torch.int32), weight.to(torch.int32), None, stride, 0, dilation, groups)\n else:\n # current version pytorch does not yet support integer-based conv2d() on GPUs\n output = torch.nn.functional.conv2d(input.float(), weight.float(), None, stride, 0, dilation, groups)\n output = output.round().to(torch.int32)\n if bias is not None:\n output = output + bias.view(1, -1, 1, 1)\n\n ############### YOUR CODE STARTS HERE ###############\n # hint: this code block should be the very similar to quantized_linear()\n\n # Step 2: scale the output\n # hint: 1. scales are floating numbers, we need to convert output to float as well\n # 2. the shape of weight scale is [oc, 1, 1, 1] while the shape of output is [batch_size, oc, height, width]\n real_scale = input_scale * weight_scale.view(-1) / output_scale\n output = output.float() * real_scale.unsqueeze(1).unsqueeze(2)\n\n # Step 3: shift output by output_zero_point\n # hint: one line of code\n output += output_zero_point\n ############### YOUR CODE STARTS HERE ###############\n\n # Make sure all value lies in the bitwidth-bit range\n output = output.round().clamp(*get_quantized_range(feature_bitwidth)).to(torch.int8)\n return output" + "text": "Quantized Inference\n양자화 후, convolution 및 fully-connected layer의 추론도 변경됩니다.\n\\(r = S(q-Z)\\)를 상기해 보면, 다음과 같습니다.\n\n\\(r_{\\mathrm{input}} = S_{\\mathrm{input}}(q_{\\mathrm{input}}-Z_{\\mathrm{input}})\\)\n\\(r_{\\mathrm{weight}} = S_{\\mathrm{weight}}(q_{\\mathrm{weight}}-Z_{\\mathrm{weight}})\\)\n\\(r_{\\mathrm{bias}} = S_{\\mathrm{bias}}(q_{\\mathrm{bias}}-Z_{\\mathrm{bias}})\\)\n\n\\(Z_{\\mathrm{weight}}=0\\)이므로, \\(r_{\\mathrm{weight}} = S_{\\mathrm{weight}}q_{\\mathrm{weight}}\\)입니다.\n부동 소수점 convolution은 다음과 같이 작성할 수 있습니다.\n\n\\(r_{\\mathrm{output}} = \\mathrm{CONV}[r_{\\mathrm{input}}, r_{\\mathrm{weight}}] + r_{\\mathrm{bias}}\\)\\ \\(\\;\\;\\;\\;\\;\\;\\;\\;= \\mathrm{CONV}[S_{\\mathrm{input}}(q_{\\mathrm{input}}-Z_{\\mathrm{input}}), S_{\\mathrm{weight}}q_{\\mathrm{weight}}] + S_{\\mathrm{bias}}(q_{\\mathrm{bias}}-Z_{\\mathrm{bias}})\\)\\ \\(\\;\\;\\;\\;\\;\\;\\;\\;= \\mathrm{CONV}[q_{\\mathrm{input}}-Z_{\\mathrm{input}}, q_{\\mathrm{weight}}]\\cdot (S_{\\mathrm{input}} \\cdot S_{\\mathrm{weight}}) + S_{\\mathrm{bias}}(q_{\\mathrm{bias}}-Z_{\\mathrm{bias}})\\)\n\n계산을 더 간단하게 하기 위해\n\n\\(Z_{\\mathrm{bias}} = 0\\)\n\\(S_{\\mathrm{bias}} = S_{\\mathrm{input}} \\cdot S_{\\mathrm{weight}}\\)\n\n로 설정하여,\n\n\\(r_{\\mathrm{output}} = (\\mathrm{CONV}[q_{\\mathrm{input}}-Z_{\\mathrm{input}}, q_{\\mathrm{weight}}] + q_{\\mathrm{bias}})\\cdot (S_{\\mathrm{input}} \\cdot S_{\\mathrm{weight}})\\) \\ \\(\\;\\;\\;\\;\\;\\;\\;\\;\\;= (\\mathrm{CONV}[q_{\\mathrm{input}}, q_{\\mathrm{weight}}] - \\mathrm{CONV}[Z_{\\mathrm{input}}, q_{\\mathrm{weight}}] + q_{\\mathrm{bias}})\\cdot (S_{\\mathrm{input}}S_{\\mathrm{weight}})\\)\n\n이며,\n\n\\(r_{\\mathrm{output}} = S_{\\mathrm{output}}(q_{\\mathrm{output}}-Z_{\\mathrm{output}})\\)\n\n이므로\n\n\\(S_{\\mathrm{output}}(q_{\\mathrm{output}}-Z_{\\mathrm{output}}) = (\\mathrm{CONV}[q_{\\mathrm{input}}, q_{\\mathrm{weight}}] - \\mathrm{CONV}[Z_{\\mathrm{input}}, q_{\\mathrm{weight}}] + q_{\\mathrm{bias}})\\cdot (S_{\\mathrm{input}} S_{\\mathrm{weight}})\\)\n\n따라서\n\n\\(q_{\\mathrm{output}} = (\\mathrm{CONV}[q_{\\mathrm{input}}, q_{\\mathrm{weight}}] - \\mathrm{CONV}[Z_{\\mathrm{input}}, q_{\\mathrm{weight}}] + q_{\\mathrm{bias}})\\cdot (S_{\\mathrm{input}}S_{\\mathrm{weight}} / S_{\\mathrm{output}}) + Z_{\\mathrm{output}}\\)\n\n\\(Z_{\\mathrm{input}}\\), \\(q_{\\mathrm{weight}}\\), \\(q_{\\mathrm{bias}}\\)는 추론 전에 결정되므로,\n\n\\(Q_{\\mathrm{bias}} = q_{\\mathrm{bias}} - \\mathrm{CONV}[Z_{\\mathrm{input}}, q_{\\mathrm{weight}}]\\)\n\n로 설정하면,\n\n\\(q_{\\mathrm{output}} = (\\mathrm{Linear}[q_{\\mathrm{input}}, q_{\\mathrm{weight}}] + Q_{\\mathrm{bias}})\\cdot (S_{\\mathrm{input}} \\cdot S_{\\mathrm{weight}} / S_{\\mathrm{output}}) + Z_{\\mathrm{output}}\\)\n\n\nQuestion 6 (5 pts)\nbias를 linear quantizing하는 함수를 완성하세요.\nHint:\n위의 추론과정에서 아래와 같은 수식을 얻었습니다.\n\n\\(Z_{\\mathrm{bias}} = 0\\)\n\\(S_{\\mathrm{bias}} = S_{\\mathrm{input}} \\cdot S_{\\mathrm{weight}}\\)\n\n\ndef linear_quantize_bias_per_output_channel(bias, weight_scale, input_scale):\n \"\"\"\n linear quantization for single bias tensor\n quantized_bias = fp_bias / bias_scale\n :param bias: [torch.FloatTensor] bias weight to be quantized\n :param weight_scale: [float or torch.FloatTensor] weight scale tensor\n :param input_scale: [float] input scale\n :return:\n [torch.IntTensor] quantized bias tensor\n \"\"\"\n assert(bias.dim() == 1)\n assert(bias.dtype == torch.float)\n assert(isinstance(input_scale, float))\n if isinstance(weight_scale, torch.Tensor):\n assert(weight_scale.dtype == torch.float)\n weight_scale = weight_scale.view(-1)\n assert(bias.numel() == weight_scale.numel())\n\n ############### YOUR CODE STARTS HERE ###############\n bias_scale = weight_scale * input_scale\n ############### YOUR CODE ENDS HERE #################\n\n quantized_bias = linear_quantize(bias, 32, bias_scale,\n zero_point=0, dtype=torch.int32)\n return quantized_bias, bias_scale, 0\n\n\n\nQuantized Fully-Connected Layer\n양자화된 fully-connected layer의 경우, \\(Q_{\\mathrm{bias}}\\)를 먼저 계산합니다. \\(Q_{\\mathrm{bias}} = q_{\\mathrm{bias}} - \\mathrm{Linear}[Z_{\\mathrm{input}}, q_{\\mathrm{weight}}]\\)를 기억하세요.\n\ndef shift_quantized_linear_bias(quantized_bias, quantized_weight, input_zero_point):\n \"\"\"\n shift quantized bias to incorporate input_zero_point for nn.Linear\n shifted_quantized_bias = quantized_bias - Linear(input_zero_point, quantized_weight)\n :param quantized_bias: [torch.IntTensor] quantized bias (torch.int32)\n :param quantized_weight: [torch.CharTensor] quantized weight (torch.int8)\n :param input_zero_point: [int] input zero point\n :return:\n [torch.IntTensor] shifted quantized bias tensor\n \"\"\"\n assert(quantized_bias.dtype == torch.int32)\n assert(isinstance(input_zero_point, int))\n return quantized_bias - quantized_weight.sum(1).to(torch.int32) * input_zero_point\n\n\nQuestion 7 (15 pts)\n아래의 양자화된 fully-connected layer inference function를 완성하세요.\nHint:\n\n\\(q_{\\mathrm{output}} = (\\mathrm{Linear}[q_{\\mathrm{input}}, q_{\\mathrm{weight}}] + Q_{\\mathrm{bias}})\\cdot (S_{\\mathrm{input}} S_{\\mathrm{weight}} / S_{\\mathrm{output}}) + Z_{\\mathrm{output}}\\)\n\n\ndef quantized_linear(input, weight, bias, feature_bitwidth, weight_bitwidth,\n input_zero_point, output_zero_point,\n input_scale, weight_scale, output_scale):\n \"\"\"\n quantized fully-connected layer\n :param input: [torch.CharTensor] quantized input (torch.int8)\n :param weight: [torch.CharTensor] quantized weight (torch.int8)\n :param bias: [torch.IntTensor] shifted quantized bias or None (torch.int32)\n :param feature_bitwidth: [int] quantization bit width of input and output\n :param weight_bitwidth: [int] quantization bit width of weight\n :param input_zero_point: [int] input zero point\n :param output_zero_point: [int] output zero point\n :param input_scale: [float] input feature scale\n :param weight_scale: [torch.FloatTensor] weight per-channel scale\n :param output_scale: [float] output feature scale\n :return:\n [torch.CharIntTensor] quantized output feature (torch.int8)\n \"\"\"\n assert(input.dtype == torch.int8)\n assert(weight.dtype == input.dtype)\n assert(bias is None or bias.dtype == torch.int32)\n assert(isinstance(input_zero_point, int))\n assert(isinstance(output_zero_point, int))\n assert(isinstance(input_scale, float))\n assert(isinstance(output_scale, float))\n assert(weight_scale.dtype == torch.float)\n\n # Step 1: integer-based fully-connected (8-bit multiplication with 32-bit accumulation)\n if 'cpu' in input.device.type:\n # use 32-b MAC for simplicity\n output = torch.nn.functional.linear(input.to(torch.int32), weight.to(torch.int32), bias)\n else:\n # current version pytorch does not yet support integer-based linear() on GPUs\n output = torch.nn.functional.linear(input.float(), weight.float(), bias.float())\n\n ############### YOUR CODE STARTS HERE ###############\n # Step 2: scale the output\n # hint: 1. scales are floating numbers, we need to convert output to float as well\n # 2. the shape of weight scale is [oc, 1, 1, 1] while the shape of output is [batch_size, oc]\n real_scale = input_scale * weight_scale.view(-1) / output_scale\n output = output.float() * real_scale\n\n # Step 3: Shift output by output_zero_point\n output += output_zero_point\n ############### YOUR CODE STARTS HERE ###############\n\n # Make sure all value lies in the bitwidth-bit range\n output = output.round().clamp(*get_quantized_range(feature_bitwidth)).to(torch.int8)\n return output\n\nLet’s verify the functionality of defined quantized fully connected layer.\n\ntest_quantized_fc()\n\n* Test quantized_fc()\n target bitwidth: 2 bits\n batch size: 4\n input channels: 8\n output channels: 8\n* Test passed.\n\n\n\n\n\n\n\n\n\n\n\n\nQuantized Convolution\n양자화된 컨볼루션 레이어의 경우, 먼저 \\(Q_{\\mathrm{bias}}\\)를 계산합니다. \\(Q_{\\mathrm{bias}} = q_{\\mathrm{bias}} - \\mathrm{CONV}[Z_{\\mathrm{input}}, q_{\\mathrm{weight}}]\\)를 기억하세요.\n\ndef shift_quantized_conv2d_bias(quantized_bias, quantized_weight, input_zero_point):\n \"\"\"\n shift quantized bias to incorporate input_zero_point for nn.Conv2d\n shifted_quantized_bias = quantized_bias - Conv(input_zero_point, quantized_weight)\n :param quantized_bias: [torch.IntTensor] quantized bias (torch.int32)\n :param quantized_weight: [torch.CharTensor] quantized weight (torch.int8)\n :param input_zero_point: [int] input zero point\n :return:\n [torch.IntTensor] shifted quantized bias tensor\n \"\"\"\n assert(quantized_bias.dtype == torch.int32)\n assert(isinstance(input_zero_point, int))\n return quantized_bias - quantized_weight.sum((1,2,3)).to(torch.int32) * input_zero_point\n\n\nQuestion 8 (15 pts)\n아래의 quantized convolution function을 완성하세요.\nHint: > \\(q_{\\mathrm{output}} = (\\mathrm{CONV}[q_{\\mathrm{input}}, q_{\\mathrm{weight}}] + Q_{\\mathrm{bias}}) \\cdot (S_{\\mathrm{input}}S_{\\mathrm{weight}} / S_{\\mathrm{output}}) + Z_{\\mathrm{output}}\\)\n\ndef quantized_conv2d(input, weight, bias, feature_bitwidth, weight_bitwidth,\n input_zero_point, output_zero_point,\n input_scale, weight_scale, output_scale,\n stride, padding, dilation, groups):\n \"\"\"\n quantized 2d convolution\n :param input: [torch.CharTensor] quantized input (torch.int8)\n :param weight: [torch.CharTensor] quantized weight (torch.int8)\n :param bias: [torch.IntTensor] shifted quantized bias or None (torch.int32)\n :param feature_bitwidth: [int] quantization bit width of input and output\n :param weight_bitwidth: [int] quantization bit width of weight\n :param input_zero_point: [int] input zero point\n :param output_zero_point: [int] output zero point\n :param input_scale: [float] input feature scale\n :param weight_scale: [torch.FloatTensor] weight per-channel scale\n :param output_scale: [float] output feature scale\n :return:\n [torch.(cuda.)CharTensor] quantized output feature\n \"\"\"\n assert(len(padding) == 4)\n assert(input.dtype == torch.int8)\n assert(weight.dtype == input.dtype)\n assert(bias is None or bias.dtype == torch.int32)\n assert(isinstance(input_zero_point, int))\n assert(isinstance(output_zero_point, int))\n assert(isinstance(input_scale, float))\n assert(isinstance(output_scale, float))\n assert(weight_scale.dtype == torch.float)\n\n # Step 1: calculate integer-based 2d convolution (8-bit multiplication with 32-bit accumulation)\n input = torch.nn.functional.pad(input, padding, 'constant', input_zero_point)\n if 'cpu' in input.device.type:\n # use 32-b MAC for simplicity\n output = torch.nn.functional.conv2d(input.to(torch.int32), weight.to(torch.int32), None, stride, 0, dilation, groups)\n else:\n # current version pytorch does not yet support integer-based conv2d() on GPUs\n output = torch.nn.functional.conv2d(input.float(), weight.float(), None, stride, 0, dilation, groups)\n output = output.round().to(torch.int32)\n if bias is not None:\n output = output + bias.view(1, -1, 1, 1)\n\n ############### YOUR CODE STARTS HERE ###############\n # hint: this code block should be the very similar to quantized_linear()\n\n # Step 2: scale the output\n # hint: 1. scales are floating numbers, we need to convert output to float as well\n # 2. the shape of weight scale is [oc, 1, 1, 1] while the shape of output is [batch_size, oc, height, width]\n real_scale = input_scale * weight_scale.view(-1) / output_scale\n output = output.float() * real_scale.unsqueeze(1).unsqueeze(2)\n\n # Step 3: shift output by output_zero_point\n # hint: one line of code\n output += output_zero_point\n ############### YOUR CODE STARTS HERE ###############\n\n # Make sure all value lies in the bitwidth-bit range\n output = output.round().clamp(*get_quantized_range(feature_bitwidth)).to(torch.int8)\n return output" }, { "objectID": "posts/labs/lab02.html#question-9-10-pts", "href": "posts/labs/lab02.html#question-9-10-pts", "title": "👩‍💻 Lab 2", "section": "Question 9 (10 pts)", - "text": "Question 9 (10 pts)\n마지막으로 모든 것을 종합하여 모델에 대한 훈련 후 int8 양자화를 수행합니다. 모델의 컨볼루션 레이어와 선형 레이어를 하나씩 양자화된 버전으로 변환합니다.\n\n먼저, BatchNorm 계층을 이전 convolutional layer에 융합할 것이며, 이는 양자화 전에 하는 표준 관행입니다. BatchNorm을 융합하면 추론 중에 추가 곱셈이 줄어듭니다.\n\n융합 모델인 model_fused가 원래 모델과 동일한 정확도를 갖는지도 검증할 예정입니다(BN fusion은 네트워크 기능을 변경하지 않는 동등한 변환입니다).\n\ndef fuse_conv_bn(conv, bn):\n # modified from https://mmcv.readthedocs.io/en/latest/_modules/mmcv/cnn/utils/fuse_conv_bn.html\n assert conv.bias is None\n\n factor = bn.weight.data / torch.sqrt(bn.running_var.data + bn.eps)\n conv.weight.data = conv.weight.data * factor.reshape(-1, 1, 1, 1)\n conv.bias = nn.Parameter(- bn.running_mean.data * factor + bn.bias.data)\n\n return conv\n\nprint('Before conv-bn fusion: backbone length', len(model.backbone))\n# fuse the batchnorm into conv layers\nrecover_model()\nmodel_fused = copy.deepcopy(model)\nfused_backbone = []\nptr = 0\nwhile ptr < len(model_fused.backbone):\n if isinstance(model_fused.backbone[ptr], nn.Conv2d) and \\\n isinstance(model_fused.backbone[ptr + 1], nn.BatchNorm2d):\n fused_backbone.append(fuse_conv_bn(\n model_fused.backbone[ptr], model_fused.backbone[ptr+ 1]))\n ptr += 2\n else:\n fused_backbone.append(model_fused.backbone[ptr])\n ptr += 1\nmodel_fused.backbone = nn.Sequential(*fused_backbone)\n\nprint('After conv-bn fusion: backbone length', len(model_fused.backbone))\n# sanity check, no BN anymore\nfor m in model_fused.modules():\n assert not isinstance(m, nn.BatchNorm2d)\n\n# the accuracy will remain the same after fusion\nfused_acc = evaluate(model_fused, dataloader['test'])\nprint(f'Accuracy of the fused model={fused_acc:.2f}%')\n\nBefore conv-bn fusion: backbone length 29\nAfter conv-bn fusion: backbone length 21\nAccuracy of the fused model=92.95%\n\n\n\n\n\n\n각 특징 맵의 범위를 얻기 위해 일부 샘플 데이터로 모델을 실행하여 특징 맵의 범위를 얻고, 해당 스케일링 팩터와 제로 포인트를 계산할 수 있습니다.\n\n\n# add hook to record the min max value of the activation\ninput_activation = {}\noutput_activation = {}\n\ndef add_range_recoder_hook(model):\n import functools\n def _record_range(self, x, y, module_name):\n x = x[0]\n input_activation[module_name] = x.detach()\n output_activation[module_name] = y.detach()\n\n all_hooks = []\n for name, m in model.named_modules():\n if isinstance(m, (nn.Conv2d, nn.Linear, nn.ReLU)):\n all_hooks.append(m.register_forward_hook(\n functools.partial(_record_range, module_name=name)))\n return all_hooks\n\nhooks = add_range_recoder_hook(model_fused)\nsample_data = iter(dataloader['train']).__next__()[0]\nmodel_fused(sample_data.cuda())\n\n# remove hooks\nfor h in hooks:\n h.remove()\n\n\n마지막으로 모델 양자화를 해보겠습니다. 다음과 같은 매핑으로 모델을 변환합니다.\n\nnn.Conv2d: QuantizedConv2d,\nnn.Linear: QuantizedLinear,\n# the following twos are just wrappers, as current\n# torch modules do not support int8 data format;\n# we will temporarily convert them to fp32 for computation\nnn.MaxPool2d: QuantizedMaxPool2d,\nnn.AvgPool2d: QuantizedAvgPool2d,\n\nclass QuantizedConv2d(nn.Module):\n def __init__(self, weight, bias,\n input_zero_point, output_zero_point,\n input_scale, weight_scale, output_scale,\n stride, padding, dilation, groups,\n feature_bitwidth=8, weight_bitwidth=8):\n super().__init__()\n # current version Pytorch does not support IntTensor as nn.Parameter\n self.register_buffer('weight', weight)\n self.register_buffer('bias', bias)\n\n self.input_zero_point = input_zero_point\n self.output_zero_point = output_zero_point\n\n self.input_scale = input_scale\n self.register_buffer('weight_scale', weight_scale)\n self.output_scale = output_scale\n\n self.stride = stride\n self.padding = (padding[1], padding[1], padding[0], padding[0])\n self.dilation = dilation\n self.groups = groups\n\n self.feature_bitwidth = feature_bitwidth\n self.weight_bitwidth = weight_bitwidth\n\n\n def forward(self, x):\n return quantized_conv2d(\n x, self.weight, self.bias,\n self.feature_bitwidth, self.weight_bitwidth,\n self.input_zero_point, self.output_zero_point,\n self.input_scale, self.weight_scale, self.output_scale,\n self.stride, self.padding, self.dilation, self.groups\n )\n\nclass QuantizedLinear(nn.Module):\n def __init__(self, weight, bias,\n input_zero_point, output_zero_point,\n input_scale, weight_scale, output_scale,\n feature_bitwidth=8, weight_bitwidth=8):\n super().__init__()\n # current version Pytorch does not support IntTensor as nn.Parameter\n self.register_buffer('weight', weight)\n self.register_buffer('bias', bias)\n\n self.input_zero_point = input_zero_point\n self.output_zero_point = output_zero_point\n\n self.input_scale = input_scale\n self.register_buffer('weight_scale', weight_scale)\n self.output_scale = output_scale\n\n self.feature_bitwidth = feature_bitwidth\n self.weight_bitwidth = weight_bitwidth\n\n def forward(self, x):\n return quantized_linear(\n x, self.weight, self.bias,\n self.feature_bitwidth, self.weight_bitwidth,\n self.input_zero_point, self.output_zero_point,\n self.input_scale, self.weight_scale, self.output_scale\n )\n\nclass QuantizedMaxPool2d(nn.MaxPool2d):\n def forward(self, x):\n # current version PyTorch does not support integer-based MaxPool\n return super().forward(x.float()).to(torch.int8)\n\nclass QuantizedAvgPool2d(nn.AvgPool2d):\n def forward(self, x):\n # current version PyTorch does not support integer-based AvgPool\n return super().forward(x.float()).to(torch.int8)\n\n# we use int8 quantization, which is quite popular\nfeature_bitwidth = weight_bitwidth = 8\nquantized_model = copy.deepcopy(model_fused)\nquantized_backbone = []\nptr = 0\nwhile ptr < len(quantized_model.backbone):\n if isinstance(quantized_model.backbone[ptr], nn.Conv2d) and \\\n isinstance(quantized_model.backbone[ptr + 1], nn.ReLU):\n conv = quantized_model.backbone[ptr]\n conv_name = f'backbone.{ptr}'\n relu = quantized_model.backbone[ptr + 1]\n relu_name = f'backbone.{ptr + 1}'\n\n input_scale, input_zero_point = \\\n get_quantization_scale_and_zero_point(\n input_activation[conv_name], feature_bitwidth)\n\n output_scale, output_zero_point = \\\n get_quantization_scale_and_zero_point(\n output_activation[relu_name], feature_bitwidth)\n\n quantized_weight, weight_scale, weight_zero_point = \\\n linear_quantize_weight_per_channel(conv.weight.data, weight_bitwidth)\n quantized_bias, bias_scale, bias_zero_point = \\\n linear_quantize_bias_per_output_channel(\n conv.bias.data, weight_scale, input_scale)\n shifted_quantized_bias = \\\n shift_quantized_conv2d_bias(quantized_bias, quantized_weight,\n input_zero_point)\n\n quantized_conv = QuantizedConv2d(\n quantized_weight, shifted_quantized_bias,\n input_zero_point, output_zero_point,\n input_scale, weight_scale, output_scale,\n conv.stride, conv.padding, conv.dilation, conv.groups,\n feature_bitwidth=feature_bitwidth, weight_bitwidth=weight_bitwidth\n )\n\n quantized_backbone.append(quantized_conv)\n ptr += 2\n elif isinstance(quantized_model.backbone[ptr], nn.MaxPool2d):\n quantized_backbone.append(QuantizedMaxPool2d(\n kernel_size=quantized_model.backbone[ptr].kernel_size,\n stride=quantized_model.backbone[ptr].stride\n ))\n ptr += 1\n elif isinstance(quantized_model.backbone[ptr], nn.AvgPool2d):\n quantized_backbone.append(QuantizedAvgPool2d(\n kernel_size=quantized_model.backbone[ptr].kernel_size,\n stride=quantized_model.backbone[ptr].stride\n ))\n ptr += 1\n else:\n raise NotImplementedError(type(quantized_model.backbone[ptr])) # should not happen\nquantized_model.backbone = nn.Sequential(*quantized_backbone)\n\n# finally, quantized the classifier\nfc_name = 'classifier'\nfc = model.classifier\ninput_scale, input_zero_point = \\\n get_quantization_scale_and_zero_point(\n input_activation[fc_name], feature_bitwidth)\n\noutput_scale, output_zero_point = \\\n get_quantization_scale_and_zero_point(\n output_activation[fc_name], feature_bitwidth)\n\nquantized_weight, weight_scale, weight_zero_point = \\\n linear_quantize_weight_per_channel(fc.weight.data, weight_bitwidth)\nquantized_bias, bias_scale, bias_zero_point = \\\n linear_quantize_bias_per_output_channel(\n fc.bias.data, weight_scale, input_scale)\nshifted_quantized_bias = \\\n shift_quantized_linear_bias(quantized_bias, quantized_weight,\n input_zero_point)\n\nquantized_model.classifier = QuantizedLinear(\n quantized_weight, shifted_quantized_bias,\n input_zero_point, output_zero_point,\n input_scale, weight_scale, output_scale,\n feature_bitwidth=feature_bitwidth, weight_bitwidth=weight_bitwidth\n)\n\n양자화 과정이 완료되었습니다! 모델 아키텍처를 인쇄하고 시각화하며 양자화된 모델의 정확성도 검증해 보겠습니다.\n\nQuestion 9.1 (5 pts)\n양자화된 모델을 실행하기 위해서는 (0, 1) 범위의 입력 데이터를 (-128, 127) 범위의 int8 범위로 매핑하는 추가적인 전처리가 필요합니다. 이 전처리를 진행하는 아래 코드를 완성하세요.\nHint: 양자화된 모델은 fp32 모델과 거의 동일한 정확도를 가지고 있습니다.\n\nprint(quantized_model)\n\ndef extra_preprocess(x):\n # hint: you need to convert the original fp32 input of range (0, 1)\n # into int8 format of range (-128, 127)\n ############### YOUR CODE STARTS HERE ###############\n return x.clamp(-128, 127).to(torch.int8)\n ############### YOUR CODE ENDS HERE #################\n\nint8_model_accuracy = evaluate(quantized_model, dataloader['test'],\n extra_preprocess=[extra_preprocess])\nprint(f\"int8 model has accuracy={int8_model_accuracy:.2f}%\")\n\nVGG(\n (backbone): Sequential(\n (0): QuantizedConv2d()\n (1): QuantizedConv2d()\n (2): QuantizedMaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)\n (3): QuantizedConv2d()\n (4): QuantizedConv2d()\n (5): QuantizedMaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)\n (6): QuantizedConv2d()\n (7): QuantizedConv2d()\n (8): QuantizedMaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)\n (9): QuantizedConv2d()\n (10): QuantizedConv2d()\n (11): QuantizedMaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)\n (12): QuantizedAvgPool2d(kernel_size=2, stride=2, padding=0)\n )\n (classifier): QuantizedLinear()\n)\nint8 model has accuracy=10.00%" + "text": "Question 9 (10 pts)\n마지막으로 모든 것을 종합하여 모델에 대한 훈련 후 int8 양자화를 수행합니다. 모델의 컨볼루션 레이어와 선형 레이어를 하나씩 양자화된 버전으로 변환합니다.\n\n먼저, BatchNorm 계층을 이전 convolutional layer에 융합할 것이며, 이는 양자화 전에 하는 표준 관행입니다. BatchNorm을 융합하면 추론 중에 추가 곱셈이 줄어듭니다.\n\n융합 모델인 model_fused가 원래 모델과 동일한 정확도를 갖는지도 검증할 예정입니다(BN fusion은 네트워크 기능을 변경하지 않는 동등한 변환입니다).\n\ndef fuse_conv_bn(conv, bn):\n # modified from https://mmcv.readthedocs.io/en/latest/_modules/mmcv/cnn/utils/fuse_conv_bn.html\n assert conv.bias is None\n\n factor = bn.weight.data / torch.sqrt(bn.running_var.data + bn.eps)\n conv.weight.data = conv.weight.data * factor.reshape(-1, 1, 1, 1)\n conv.bias = nn.Parameter(- bn.running_mean.data * factor + bn.bias.data)\n\n return conv\n\nprint('Before conv-bn fusion: backbone length', len(model.backbone))\n# fuse the batchnorm into conv layers\nrecover_model()\nmodel_fused = copy.deepcopy(model)\nfused_backbone = []\nptr = 0\nwhile ptr < len(model_fused.backbone):\n if isinstance(model_fused.backbone[ptr], nn.Conv2d) and \\\n isinstance(model_fused.backbone[ptr + 1], nn.BatchNorm2d):\n fused_backbone.append(fuse_conv_bn(\n model_fused.backbone[ptr], model_fused.backbone[ptr+ 1]))\n ptr += 2\n else:\n fused_backbone.append(model_fused.backbone[ptr])\n ptr += 1\nmodel_fused.backbone = nn.Sequential(*fused_backbone)\n\nprint('After conv-bn fusion: backbone length', len(model_fused.backbone))\n# sanity check, no BN anymore\nfor m in model_fused.modules():\n assert not isinstance(m, nn.BatchNorm2d)\n\n# the accuracy will remain the same after fusion\nfused_acc = evaluate(model_fused, dataloader['test'])\nprint(f'Accuracy of the fused model={fused_acc:.2f}%')\n\nBefore conv-bn fusion: backbone length 29\nAfter conv-bn fusion: backbone length 21\nAccuracy of the fused model=92.95%\n\n\n\n\n\n\n각 특징 맵의 범위를 얻기 위해 일부 샘플 데이터로 모델을 실행하여 특징 맵의 범위를 얻고, 해당 스케일링 팩터와 제로 포인트를 계산할 수 있습니다.\n\n\n# add hook to record the min max value of the activation\ninput_activation = {}\noutput_activation = {}\n\ndef add_range_recoder_hook(model):\n import functools\n def _record_range(self, x, y, module_name):\n x = x[0]\n input_activation[module_name] = x.detach()\n output_activation[module_name] = y.detach()\n\n all_hooks = []\n for name, m in model.named_modules():\n if isinstance(m, (nn.Conv2d, nn.Linear, nn.ReLU)):\n all_hooks.append(m.register_forward_hook(\n functools.partial(_record_range, module_name=name)))\n return all_hooks\n\nhooks = add_range_recoder_hook(model_fused)\nsample_data = iter(dataloader['train']).__next__()[0]\nmodel_fused(sample_data.cuda())\n\n# remove hooks\nfor h in hooks:\n h.remove()\n\n\n마지막으로 모델 양자화를 해보겠습니다. 다음과 같은 매핑으로 모델을 변환합니다.\n\nnn.Conv2d: QuantizedConv2d,\nnn.Linear: QuantizedLinear,\n# the following twos are just wrappers, as current\n# torch modules do not support int8 data format;\n# we will temporarily convert them to fp32 for computation\nnn.MaxPool2d: QuantizedMaxPool2d,\nnn.AvgPool2d: QuantizedAvgPool2d,\n\nclass QuantizedConv2d(nn.Module):\n def __init__(self, weight, bias,\n input_zero_point, output_zero_point,\n input_scale, weight_scale, output_scale,\n stride, padding, dilation, groups,\n feature_bitwidth=8, weight_bitwidth=8):\n super().__init__()\n # current version Pytorch does not support IntTensor as nn.Parameter\n self.register_buffer('weight', weight)\n self.register_buffer('bias', bias)\n\n self.input_zero_point = input_zero_point\n self.output_zero_point = output_zero_point\n\n self.input_scale = input_scale\n self.register_buffer('weight_scale', weight_scale)\n self.output_scale = output_scale\n\n self.stride = stride\n self.padding = (padding[1], padding[1], padding[0], padding[0])\n self.dilation = dilation\n self.groups = groups\n\n self.feature_bitwidth = feature_bitwidth\n self.weight_bitwidth = weight_bitwidth\n\n\n def forward(self, x):\n return quantized_conv2d(\n x, self.weight, self.bias,\n self.feature_bitwidth, self.weight_bitwidth,\n self.input_zero_point, self.output_zero_point,\n self.input_scale, self.weight_scale, self.output_scale,\n self.stride, self.padding, self.dilation, self.groups\n )\n\nclass QuantizedLinear(nn.Module):\n def __init__(self, weight, bias,\n input_zero_point, output_zero_point,\n input_scale, weight_scale, output_scale,\n feature_bitwidth=8, weight_bitwidth=8):\n super().__init__()\n # current version Pytorch does not support IntTensor as nn.Parameter\n self.register_buffer('weight', weight)\n self.register_buffer('bias', bias)\n\n self.input_zero_point = input_zero_point\n self.output_zero_point = output_zero_point\n\n self.input_scale = input_scale\n self.register_buffer('weight_scale', weight_scale)\n self.output_scale = output_scale\n\n self.feature_bitwidth = feature_bitwidth\n self.weight_bitwidth = weight_bitwidth\n\n def forward(self, x):\n return quantized_linear(\n x, self.weight, self.bias,\n self.feature_bitwidth, self.weight_bitwidth,\n self.input_zero_point, self.output_zero_point,\n self.input_scale, self.weight_scale, self.output_scale\n )\n\nclass QuantizedMaxPool2d(nn.MaxPool2d):\n def forward(self, x):\n # current version PyTorch does not support integer-based MaxPool\n return super().forward(x.float()).to(torch.int8)\n\nclass QuantizedAvgPool2d(nn.AvgPool2d):\n def forward(self, x):\n # current version PyTorch does not support integer-based AvgPool\n return super().forward(x.float()).to(torch.int8)\n\n# we use int8 quantization, which is quite popular\nfeature_bitwidth = weight_bitwidth = 8\nquantized_model = copy.deepcopy(model_fused)\nquantized_backbone = []\nptr = 0\nwhile ptr < len(quantized_model.backbone):\n if isinstance(quantized_model.backbone[ptr], nn.Conv2d) and \\\n isinstance(quantized_model.backbone[ptr + 1], nn.ReLU):\n conv = quantized_model.backbone[ptr]\n conv_name = f'backbone.{ptr}'\n relu = quantized_model.backbone[ptr + 1]\n relu_name = f'backbone.{ptr + 1}'\n\n input_scale, input_zero_point = \\\n get_quantization_scale_and_zero_point(\n input_activation[conv_name], feature_bitwidth)\n\n output_scale, output_zero_point = \\\n get_quantization_scale_and_zero_point(\n output_activation[relu_name], feature_bitwidth)\n\n quantized_weight, weight_scale, weight_zero_point = \\\n linear_quantize_weight_per_channel(conv.weight.data, weight_bitwidth)\n quantized_bias, bias_scale, bias_zero_point = \\\n linear_quantize_bias_per_output_channel(\n conv.bias.data, weight_scale, input_scale)\n shifted_quantized_bias = \\\n shift_quantized_conv2d_bias(quantized_bias, quantized_weight,\n input_zero_point)\n\n quantized_conv = QuantizedConv2d(\n quantized_weight, shifted_quantized_bias,\n input_zero_point, output_zero_point,\n input_scale, weight_scale, output_scale,\n conv.stride, conv.padding, conv.dilation, conv.groups,\n feature_bitwidth=feature_bitwidth, weight_bitwidth=weight_bitwidth\n )\n\n quantized_backbone.append(quantized_conv)\n ptr += 2\n elif isinstance(quantized_model.backbone[ptr], nn.MaxPool2d):\n quantized_backbone.append(QuantizedMaxPool2d(\n kernel_size=quantized_model.backbone[ptr].kernel_size,\n stride=quantized_model.backbone[ptr].stride\n ))\n ptr += 1\n elif isinstance(quantized_model.backbone[ptr], nn.AvgPool2d):\n quantized_backbone.append(QuantizedAvgPool2d(\n kernel_size=quantized_model.backbone[ptr].kernel_size,\n stride=quantized_model.backbone[ptr].stride\n ))\n ptr += 1\n else:\n raise NotImplementedError(type(quantized_model.backbone[ptr])) # should not happen\nquantized_model.backbone = nn.Sequential(*quantized_backbone)\n\n# finally, quantized the classifier\nfc_name = 'classifier'\nfc = model.classifier\ninput_scale, input_zero_point = \\\n get_quantization_scale_and_zero_point(\n input_activation[fc_name], feature_bitwidth)\n\noutput_scale, output_zero_point = \\\n get_quantization_scale_and_zero_point(\n output_activation[fc_name], feature_bitwidth)\n\nquantized_weight, weight_scale, weight_zero_point = \\\n linear_quantize_weight_per_channel(fc.weight.data, weight_bitwidth)\nquantized_bias, bias_scale, bias_zero_point = \\\n linear_quantize_bias_per_output_channel(\n fc.bias.data, weight_scale, input_scale)\nshifted_quantized_bias = \\\n shift_quantized_linear_bias(quantized_bias, quantized_weight,\n input_zero_point)\n\nquantized_model.classifier = QuantizedLinear(\n quantized_weight, shifted_quantized_bias,\n input_zero_point, output_zero_point,\n input_scale, weight_scale, output_scale,\n feature_bitwidth=feature_bitwidth, weight_bitwidth=weight_bitwidth\n)\n\n양자화 과정이 완료되었습니다! 모델 아키텍처를 인쇄하고 시각화하며 양자화된 모델의 정확성도 검증해 보겠습니다.\n\nQuestion 9.1 (5 pts)\n양자화된 모델을 실행하기 위해서는 (0, 1) 범위의 입력 데이터를 (-128, 127) 범위의 int8 범위로 매핑하는 추가적인 전처리가 필요합니다. 이 전처리를 진행하는 아래 코드를 완성하세요.\nHint: 양자화된 모델은 fp32 모델과 거의 동일한 정확도를 가지고 있습니다.\n\nprint(quantized_model)\n\ndef extra_preprocess(x):\n # hint: you need to convert the original fp32 input of range (0, 1)\n # into int8 format of range (-128, 127)\n ############### YOUR CODE STARTS HERE ###############\n x_scaled = x * 255\n x_shifted = x_scaled - 128\n return x_shifted.clamp(-128, 127).to(torch.int8)\n ############### YOUR CODE ENDS HERE #################\n\nint8_model_accuracy = evaluate(quantized_model, dataloader['test'],\n extra_preprocess=[extra_preprocess])\nprint(f\"int8 model has accuracy={int8_model_accuracy:.2f}%\")\n\nVGG(\n (backbone): Sequential(\n (0): QuantizedConv2d()\n (1): QuantizedConv2d()\n (2): QuantizedMaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)\n (3): QuantizedConv2d()\n (4): QuantizedConv2d()\n (5): QuantizedMaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)\n (6): QuantizedConv2d()\n (7): QuantizedConv2d()\n (8): QuantizedMaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)\n (9): QuantizedConv2d()\n (10): QuantizedConv2d()\n (11): QuantizedMaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)\n (12): QuantizedAvgPool2d(kernel_size=2, stride=2, padding=0)\n )\n (classifier): QuantizedLinear()\n)\nint8 model has accuracy=10.00%" }, { "objectID": "posts/labs/lab02.html#question-9.2-bonus-question-5-pts", diff --git a/posts/labs/lab02.ipynb b/posts/labs/lab02.ipynb index de0e865..2cd0519 100644 --- a/posts/labs/lab02.ipynb +++ b/posts/labs/lab02.ipynb @@ -50,7 +50,7 @@ "source": [ "## Goals\n", "\n", - "이 과제에서는 모델 크기와 지연 시간을 줄이기 위해 클래식한 **neural network model**을 **quantizing**하는 연습을 할 것입니다. 이 과제의 목표는 다음과 같습니다:\n", + "이번 실습에서는 모델 크기와 지연 시간을 줄이기 위해 클래식한 **neural network model**을 **quantizing**하는 연습을 할 것입니다. 이 실습의 목표는 다음과 같습니다:\n", "\n", "- **Quantization**의 기본 개념을 이해합니다.\n", "- **k-means quantization**을 구현하고 적용합니다.\n", @@ -72,9 +72,10 @@ "주요 섹션은 ***K-Means Quantization*** 과 ***Linear Quantization*** 2가지로 구성되어 있습니다.\n", "\n", "이번 실습 노트에서 총 ***10***개의 질문을 통해 학습하게 됩니다.:\n", - "- *K-Means Quantization*에 대해서는 ***3***개의 질문이 있습니다 (질문 1-3).\n", - "- *Linear Quantization*에 대해서는 ***6***개의 질문이 있습니다 (질문 4-9).\n", - "- 질문 10은 k-means quantization과 linear quantization을 비교합니다." + "\n", + "- *K-Means Quantization*에 대해서는 ***3***개의 질문이 있습니다 (Question 1-3).\n", + "- *Linear Quantization*에 대해서는 ***6***개의 질문이 있습니다 (Question 4-9).\n", + "- Question 10은 k-means quantization과 linear quantization을 비교합니다." ] }, { @@ -197,6 +198,7 @@ "$n$-bit k-means **quantization**은 시냅스를 $2^n$ 개의 클러스터로 나누고, 동일한 클러스터 내의 시냅스는 동일한 가중치 값을 공유하게 됩니다.\n", "\n", "따라서, k-means **quantization**은 다음과 같은 codebook을 생성합니다:\n", + "\n", "* `centroids`: $2^n$ fp32 클러스터 중심.\n", "* `labels`: 원래 fp32 가중치 텐서와 동일한 #elements를 가진 $n$-bit 정수 텐서. 각 정수는 해당 클러스터가 어디에 속하는지를 나타냅니다.\n", "\n", @@ -1754,9 +1756,9 @@ "\n", "부동 소수점 convolution은 다음과 같이 작성할 수 있습니다.\n", "\n", - "> $r_{\\mathrm{output}} = \\mathrm{CONV}[r_{\\mathrm{input}}, r_{\\mathrm{weight}}] + r_{\\mathrm{bias}}\\\\\n", - "\\;\\;\\;\\;\\;\\;\\;\\;= \\mathrm{CONV}[S_{\\mathrm{input}}(q_{\\mathrm{input}}-Z_{\\mathrm{input}}), S_{\\mathrm{weight}}q_{\\mathrm{weight}}] + S_{\\mathrm{bias}}(q_{\\mathrm{bias}}-Z_{\\mathrm{bias}})\\\\\n", - "\\;\\;\\;\\;\\;\\;\\;\\;= \\mathrm{CONV}[q_{\\mathrm{input}}-Z_{\\mathrm{input}}, q_{\\mathrm{weight}}]\\cdot (S_{\\mathrm{input}} \\cdot S_{\\mathrm{weight}}) + S_{\\mathrm{bias}}(q_{\\mathrm{bias}}-Z_{\\mathrm{bias}})$\n", + "> $r_{\\mathrm{output}} = \\mathrm{CONV}[r_{\\mathrm{input}}, r_{\\mathrm{weight}}] + r_{\\mathrm{bias}}$\\\\\n", + "> $\\;\\;\\;\\;\\;\\;\\;\\;= \\mathrm{CONV}[S_{\\mathrm{input}}(q_{\\mathrm{input}}-Z_{\\mathrm{input}}), S_{\\mathrm{weight}}q_{\\mathrm{weight}}] + S_{\\mathrm{bias}}(q_{\\mathrm{bias}}-Z_{\\mathrm{bias}})$\\\\\n", + "> $\\;\\;\\;\\;\\;\\;\\;\\;= \\mathrm{CONV}[q_{\\mathrm{input}}-Z_{\\mathrm{input}}, q_{\\mathrm{weight}}]\\cdot (S_{\\mathrm{input}} \\cdot S_{\\mathrm{weight}}) + S_{\\mathrm{bias}}(q_{\\mathrm{bias}}-Z_{\\mathrm{bias}})$\n", "\n", "계산을 더 간단하게 하기 위해\n", "\n", @@ -1766,8 +1768,8 @@ "\n", "로 설정하여,\n", "\n", - "> $r_{\\mathrm{output}} = (\\mathrm{CONV}[q_{\\mathrm{input}}-Z_{\\mathrm{input}}, q_{\\mathrm{weight}}] + q_{\\mathrm{bias}})\\cdot (S_{\\mathrm{input}} \\cdot S_{\\mathrm{weight}})$\n", - "> $\\;\\;\\;\\;\\;\\;\\;\\;= (\\mathrm{CONV}[q_{\\mathrm{input}}, q_{\\mathrm{weight}}] - \\mathrm{CONV}[Z_{\\mathrm{input}}, q_{\\mathrm{weight}}] + q_{\\mathrm{bias}})\\cdot (S_{\\mathrm{input}}S_{\\mathrm{weight}})$\n", + "> $r_{\\mathrm{output}} = (\\mathrm{CONV}[q_{\\mathrm{input}}-Z_{\\mathrm{input}}, q_{\\mathrm{weight}}] + q_{\\mathrm{bias}})\\cdot (S_{\\mathrm{input}} \\cdot S_{\\mathrm{weight}})$ \\\\\n", + "> $\\;\\;\\;\\;\\;\\;\\;\\;\\;= (\\mathrm{CONV}[q_{\\mathrm{input}}, q_{\\mathrm{weight}}] - \\mathrm{CONV}[Z_{\\mathrm{input}}, q_{\\mathrm{weight}}] + q_{\\mathrm{bias}})\\cdot (S_{\\mathrm{input}}S_{\\mathrm{weight}})$\n", "\n", "이며,\n", "\n", @@ -2574,7 +2576,9 @@ " # hint: you need to convert the original fp32 input of range (0, 1)\n", " # into int8 format of range (-128, 127)\n", " ############### YOUR CODE STARTS HERE ###############\n", - " return x.clamp(-128, 127).to(torch.int8)\n", + " x_scaled = x * 255\n", + " x_shifted = x_scaled - 128\n", + " return x_shifted.clamp(-128, 127).to(torch.int8)\n", " ############### YOUR CODE ENDS HERE #################\n", "\n", "int8_model_accuracy = evaluate(quantized_model, dataloader['test'],\n", @@ -2655,24 +2659,6 @@ "\n", "응용 프로그램의 특정 요구 사항에 따라 K-means 기반 양자화와 선형 양자화 사이에서 선택해야 하며, 정확성, 처리 지연 시간 및 사용 가능한 계산 리소스의 중요성을 고려해야 합니다." ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "xxOBqoXoSUfE" - }, - "source": [ - "# Feedback" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ajqZJes3SVc-" - }, - "source": [ - "Please fill out this [feedback form](https://forms.gle/ZeCH5anNPrkd5wpp7) when you finished this lab. We would love to hear your thoughts or feedback on how we can improve this lab!" - ] } ], "metadata": {