diff --git a/samples/python/README.md b/samples/python/README.md
new file mode 100644
index 0000000..9e137ad
--- /dev/null
+++ b/samples/python/README.md
@@ -0,0 +1,574 @@
+# README
+
+## Introduction 
+This guide helps developers use QAI AppBuilder with the QNN SDK to execute models on Windows on Snapdragon (WoS) platforms.
+
+## Setting Up QAI AppBuilder Environment and Preparing QNN SDK Libraries
+
+Set up QAI AppBuilder environment and prepare QNN SDK libraries by referring to the links below: 
+
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
+
+Set up a new folder named by the model you tend to deploy:
+```
+C:\ai-hub\model_name\
+```
+
+Set up a new folder called `qnn` and copy the QNN libraries from QNN SDK to below path:
+```
+C:\ai-hub\model_name\qnn\libqnnhtpv73.cat
+C:\ai-hub\model_name\qnn\libQnnHtpV73Skel.so
+C:\ai-hub\model_name\qnn\QnnCpu.dll
+C:\ai-hub\model_name\qnn\QnnHtp.dll
+C:\ai-hub\model_name\qnn\QnnHtpPrepare.dll
+C:\ai-hub\model_name\qnn\QnnHtpV73Stub.dll
+C:\ai-hub\model_name\qnn\QnnSystem.dll
+```
+
+## Prepare the QNN model
+Download the QNN version of the model you wish to deploy from the Qualcomm® AI Hub:
+https://aihub.qualcomm.com/compute/models/  
+
+Choose the `TorchScript -> Qualcomm® AI Engine Direct` option and the `Download model` option to download the QNN version of the model for deployment on Windows on Snapdragon (WoS) platforms.
+
+If these options are unavailable, use the following command to export the QNN version of the model:
+```
+python -m qai_hub_models.models.model_name.export --device "Snapdragon X Elite CRD" --target-runtime qnn
+```
+
+Part of the command output is shown below. You can download the model using the following link: [https://app.aihub.qualcomm.com/jobs/j1p86jxog/](https://app.aihub.qualcomm.com/jobs/j1p86jxog/)
+
+ E.g. real_esrgan_general_x4v3:
+
+```
+Optimizing model real_esrgan_general_x4v3 to run on-device
+Uploading model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64.7M/64.7M [00:21<00:00, 3.15MB/s]
+Scheduled compile job (j1p86jxog) successfully. To see the status and results:
+    https://app.aihub.qualcomm.com/jobs/j1p86jxog/
+```
+
+Note: The link is generated differently based on the model you tend to deploy.
+
+After downloaded the model, set up a new folder called `models` and copy it to the following path:
+```
+C:\ai-hub\model_name\models\model_name.bin
+```
+## Quantize the QNN model
+
+Use the option `--quantize_full_type <type>` for `submit_compile_job` to quantizes an unquantized model to the specified type. It quantizes both activations and weights using a representative dataset. If not such dataset is provided, a randomly generated one is used. In that case, the generated model can be used as a proxy for the achievable performance only, as the model will not be able to produce accurate results. 
+
+Options:
+
+- `int8`: quantize activations and weights to int8
+- `int16`: quantize activations and weights to int16
+- `w8a16`: quantize weights to int8 and activations to int16 (recommended over int16)
+- `w4a8`: quantize weights to int4 and activations to int8
+- `w4a16`: quantize weights to int4 and activations to int16
+
+Requirements:
+
+- This option cannot be used if the target runtime is ONNX.
+
+Examples:
+
+Build a new python file called `export.py` and copy the code of `qai_hub_models.models.model_name.export` to this file.  Add one code `compile_options = " --quantize_full_type int8"` to the `export.py` of the model to be quantized and use the following command to export the quantized model:
+
+```
+python .\export.py --target-runtime qnn --device "Snapdragon X Elite CRD"  
+```
+
+The part of change of the `export.py` is as follows:
+
+```
+# 2. Compile the model to an on-device asset
+compile_options = " --quantize_full_type int8" #add this code to quantize the model
+
+model_compile_options = model.get_hub_compile_options(
+	target_runtime, compile_options + channel_last_flags, hub_device
+)
+print(f"Optimizing model {model_name} to run on-device")
+submitted_compile_job = hub.submit_compile_job(
+	model=source_model,
+	input_specs=input_spec,
+	device=hub_device,
+    name=model_name,
+    options=model_compile_options,
+)
+compile_job = cast(hub.client.CompileJob, submitted_compile_job)
+```
+
+## Sample Code for Deploying the Model and Executing Inference
+
+The sample code refers to the following path:
+https://github.com/quic/ai-engine-direct-helper/blob/main/docs/user_guide.md
+
+Here we take `lama_dilated` as example:
+
+Sample Code (Python):
+```
+import os
+import numpy as np
+import torch
+import torchvision.transforms as transforms
+
+from PIL import Image
+from PIL.Image import fromarray as ImageFromArray
+from torch.nn.functional import interpolate, pad
+from torchvision import transforms
+from typing import Callable, Dict, List, Tuple
+
+from qai_appbuilder import (QNNContext, Runtime, LogLevel, ProfilingLevel, PerfProfile, QNNConfig)
+
+image_size = 512
+lamadilated = None
+image_buffer = None
+
+
+def preprocess_PIL_image(image: Image) -> torch.Tensor:
+    """Convert a PIL image into a pyTorch tensor with range [0, 1] and shape NCHW."""
+    transform = transforms.Compose([transforms.Resize(image_size),      # bgr image
+                                    transforms.CenterCrop(image_size),
+                                    transforms.PILToTensor()])
+    img: torch.Tensor = transform(image)  # type: ignore
+    img = img.float().unsqueeze(0) / 255.0  # int 0 - 255 to float 0.0 - 1.0
+    return img
+
+def torch_tensor_to_PIL_image(data: torch.Tensor) -> Image:
+    """
+    Convert a Torch tensor (dtype float32) with range [0, 1] and shape CHW into PIL image CHW
+    """
+    out = torch.clip(data, min=0.0, max=1.0)
+    np_out = (out.detach().numpy() * 255).astype(np.uint8)
+    return ImageFromArray(np_out)
+   
+def preprocess_inputs(
+    pixel_values_or_image: Image,
+    mask_pixel_values_or_image: Image,
+) -> Dict[str, torch.Tensor]:
+
+    NCHW_fp32_torch_frames = preprocess_PIL_image(pixel_values_or_image)
+    NCHW_fp32_torch_masks = preprocess_PIL_image(mask_pixel_values_or_image)
+    
+    # The number of input images should equal the number of input masks.
+    if NCHW_fp32_torch_masks.shape[0] != 1:
+        NCHW_fp32_torch_masks = NCHW_fp32_torch_masks.tile(
+            (NCHW_fp32_torch_frames.shape[0], 1, 1, 1)
+        )
+  
+    # Mask input image
+    image_masked = (
+        NCHW_fp32_torch_frames * (1 - NCHW_fp32_torch_masks) + NCHW_fp32_torch_masks
+    )
+    
+    return {"image": image_masked, "mask": NCHW_fp32_torch_masks}
+
+# LamaDilated class which inherited from the class QNNContext.
+class LamaDilated(QNNContext):
+    def Inference(self, input_data, input_mask):
+        input_datas=[input_data, input_mask]
+        output_data = super().Inference(input_datas)[0]
+        return output_data
+        
+def Init():
+    global lamadilated
+
+    # Config AppBuilder environment.
+    QNNConfig.Config(os.getcwd() + "\\qnn", Runtime.HTP, LogLevel.WARN, ProfilingLevel.BASIC)
+
+    # Instance for LamaDilated objects.
+    lamadilated_model = "models\\lama_dilated.bin"
+    lamadilated = LamaDilated("lamadilated", lamadilated_model)
+
+def Inference(input_image_path, input_mask_path, output_image_path):
+    global image_buffer
+
+    # Read and preprocess the image&mask.
+    image = Image.open(input_image_path)
+    mask = Image.open(input_mask_path)   
+    inputs = preprocess_inputs(image, mask)
+    image_masked, mask_torch = inputs["image"], inputs["mask"]         
+    image_masked = image_masked.numpy()
+    mask_torch = mask_torch.numpy()
+     
+    image_masked = np.transpose(image_masked, (0, 2, 3, 1)) 
+    mask_torch = np.transpose(mask_torch, (0, 2, 3, 1)) 
+             
+    # Burst the HTP.
+    PerfProfile.SetPerfProfileGlobal(PerfProfile.BURST)
+
+    # Run the inference.
+    output_image = lamadilated.Inference([image_masked], [mask_torch])
+    
+    # Reset the HTP.
+    PerfProfile.RelPerfProfileGlobal()
+    
+    # show&save the result
+    output_image = torch.from_numpy(output_image)    
+    output_image = output_image.reshape(image_size, image_size, 3)  
+    output_image = torch.unsqueeze(output_image, 0)      
+    output_image = [torch_tensor_to_PIL_image(img) for img in output_image]
+    image_buffer = output_image[0]
+    image_buffer.show()  
+    image_buffer.save(output_image_path)
+
+def Release():
+    global lamadilated
+
+    # Release the resources.
+    del(lamadilated)
+
+
+Init()
+
+Inference("test_input_image.png", "test_input_mask.png", "out.png")
+
+Release()
+```
+
+1. Use the `See more metrics` option from the following link: [https://aihub.qualcomm.com/compute/models/lama_dilated](https://aihub.qualcomm.com/compute/models/lama_dilated) to check the input data format. Here, we can see that:
+
+   `Input Specs`
+
+   `image`: float32[1, 512, 512, 3]
+
+   `mask`: float32[1, 512, 512, 1]
+
+   The input data format for `lama_dilated` is `[N, H, W, C]`, which stands for Number, Height, Width, and Channel.
+
+2. Refer to the following link to learn how to preprocess input data and post-process output data:
+
+   https://github.com/quic/ai-hub-models/blob/main/qai_hub_models/models/lama_dilated/demo.py
+
+   Refer to the following code to check the inference demo for `lama_dilated` in AI Hub:
+
+   ```
+   def main(is_test: bool = False):
+       repaint_demo(LamaDilated, MODEL_ID, IMAGE_ADDRESS, MASK_ADDRESS, is_test)
+   ```
+
+   ```
+   # Run repaint app end-to-end on a sample image.
+   # The demo will display the predicted image in a window.
+   def repaint_demo(
+       model_type: Type[BaseModel],
+       model_id: str,
+       default_image: str | CachedWebAsset,
+       default_mask: str | CachedWebAsset,
+       is_test: bool = False,
+       available_target_runtimes: List[TargetRuntime] = list(
+           TargetRuntime.__members__.values()
+       ),
+   ):
+       # Demo parameters
+       parser = get_model_cli_parser(model_type)
+       parser = get_on_device_demo_parser(
+           parser, available_target_runtimes=available_target_runtimes, add_output_dir=True
+       )
+       parser.add_argument(
+           "--image",
+           type=str,
+           default=default_image,
+           help="test image file path or URL",
+       )
+       parser.add_argument(
+           "--mask",
+           type=str,
+           default=default_mask,
+           help="test mask file path or URL",
+       )
+       args = parser.parse_args([] if is_test else None)
+       validate_on_device_demo_args(args, model_id)
+   
+       # Load image & model
+       model = demo_model_from_cli_args(model_type, model_id, args)
+       image = load_image(args.image)
+       mask = load_image(args.mask)
+       print("Model Loaded")
+   
+       # Run app
+       app = RepaintMaskApp(model)
+       out = app.paint_mask_on_image(image, mask)[0]
+   
+       if not is_test:
+           display_or_save_image(image, args.output_dir, "input_image.png", "input image")
+           display_or_save_image(out, args.output_dir, "output_image.png", "output image")
+   ```
+
+   The key function for inference in AI Hub is `app.paint_mask_on_image(image, mask)[0]`:
+
+   ```
+       def paint_mask_on_image(
+           self,
+           pixel_values_or_image: torch.Tensor | np.ndarray | Image | List[Image],
+           mask_pixel_values_or_image: torch.Tensor | np.ndarray | Image,
+       ) -> List[Image]:
+           """
+           Erases and repaints the source image[s] in the pixel values given by the mask.
+   
+           Parameters:
+               pixel_values_or_image
+                   PIL image(s)
+                   or
+                   numpy array (N H W C x uint8) or (H W C x uint8) -- both RGB channel layout
+                   or
+                   pyTorch tensor (N C H W x fp32, value range is [0, 1]), RGB channel layout
+   
+               mask_pixel_values_or_image
+                   PIL image(s)
+                   or
+                   numpy array (N H W C x uint8) or (H W C x uint8) -- both RGB channel layout
+                   or
+                   pyTorch tensor (N C H W x fp32, value range is [0, 1]), RGB channel layout
+   
+                   If one mask is provided, it will be used for every input image.
+   
+           Returns:
+               images: List[PIL.Image]
+                   A list of predicted images (one list element per batch).
+           """
+           inputs = self.preprocess_inputs(
+               pixel_values_or_image, mask_pixel_values_or_image
+           )
+   
+           out = self.model(inputs["image"], inputs["mask"])
+   
+           return [torch_tensor_to_PIL_image(img) for img in out]
+   ```
+
+3. Use the code `inputs = self.preprocess_inputs(pixel_values_or_image, mask_pixel_values_or_image)` to preprocess input data (PIL Images). The corresponding sample code is as follows:
+
+   ```
+   # Read and preprocess the image&mask.
+   image = Image.open(input_image_path)
+   mask = Image.open(input_mask_path)   
+   inputs = preprocess_inputs(image, mask)
+   image_masked, mask_torch = inputs["image"], inputs["mask"]         
+   image_masked = image_masked.numpy()
+   mask_torch = mask_torch.numpy()
+        
+   image_masked = np.transpose(image_masked, (0, 2, 3, 1)) 
+   mask_torch = np.transpose(mask_torch, (0, 2, 3, 1)) 
+   ```
+
+   The function `preprocess_inputs(image, mask)` remains the same as in AI Hub:
+
+   ```
+   def preprocess_PIL_image(image: Image) -> torch.Tensor:
+       """Convert a PIL image into a pyTorch tensor with range [0, 1] and shape NCHW."""
+       transform = transforms.Compose([transforms.Resize(image_size),      # bgr image
+                                       transforms.CenterCrop(image_size),
+                                       transforms.PILToTensor()])
+       img: torch.Tensor = transform(image)  # type: ignore
+       img = img.float().unsqueeze(0) / 255.0  # int 0 - 255 to float 0.0 - 1.0
+       return img
+   ```
+
+   ```
+   def preprocess_inputs(
+       pixel_values_or_image: Image,
+       mask_pixel_values_or_image: Image,
+   ) -> Dict[str, torch.Tensor]:
+   
+       NCHW_fp32_torch_frames = preprocess_PIL_image(pixel_values_or_image)
+       NCHW_fp32_torch_masks = preprocess_PIL_image(mask_pixel_values_or_image)
+       
+       # The number of input images should equal the number of input masks.
+       if NCHW_fp32_torch_masks.shape[0] != 1:
+           NCHW_fp32_torch_masks = NCHW_fp32_torch_masks.tile(
+               (NCHW_fp32_torch_frames.shape[0], 1, 1, 1)
+           )
+     
+       # Mask input image
+       image_masked = (
+           NCHW_fp32_torch_frames * (1 - NCHW_fp32_torch_masks) + NCHW_fp32_torch_masks
+       )
+       
+       return {"image": image_masked, "mask": NCHW_fp32_torch_masks}
+   ```
+
+   Since the input data type for the QNN version model is a `numpy array` with the format `[N, H, W, C]`, we need to convert the output of the `preprocess_inputs` function from `torch` to `numpy` and adjust its format to `[N, H, W, C]`:
+
+   ```
+   image_masked = image_masked.numpy()
+   mask_torch = mask_torch.numpy()
+        
+   image_masked = np.transpose(image_masked, (0, 2, 3, 1)) 
+   mask_torch = np.transpose(mask_torch, (0, 2, 3, 1)) 
+   ```
+
+4. The output data type for the QNN version model is also a `numpy array`. Therefore, we need to convert it to `torch` and reshape it to the format `[N, H, W, C]`, as the format of the output data corresponds to that of the input data:
+
+   ```
+   def torch_tensor_to_PIL_image(data: torch.Tensor) -> Image:
+       """
+       Convert a Torch tensor (dtype float32) with range [0, 1] and shape CHW into PIL image CHW
+       """
+       out = torch.clip(data, min=0.0, max=1.0)
+       np_out = (out.detach().numpy() * 255).astype(np.uint8)
+       return ImageFromArray(np_out)
+   ```
+
+   ```
+   output_image = torch.from_numpy(output_image)    
+   output_image = output_image.reshape(image_size, image_size, 3)  
+   output_image = torch.unsqueeze(output_image, 0)      
+   output_image = [torch_tensor_to_PIL_image(img) for img in output_image]
+   ```
+
+   The above corresponds to the code below in AI Hub:
+
+   ```
+   out = app.paint_mask_on_image(image, mask)[0]
+   ```
+
+   ```
+   return [torch_tensor_to_PIL_image(img) for img in out]
+   ```
+
+   The final format of the output data is `PIL`.
+
+
+Note:
+1. The input and output data patterns are `numpy arrays`. You may need to convert the output to `torch` for post-processing.
+2. Use the `See more metrics` option from the following link: [https://aihub.qualcomm.com/compute/models/lama_dilated](https://aihub.qualcomm.com/compute/models/lama_dilated) to check the input data pattern. The input data pattern might be `[N, H, W, C]` or `[N, C, W, H]`, and the inference results will vary significantly if the input data pattern is incorrect. Add the following code to the sample code for pattern conversion:
+```
+input_data = np.transpose(input_data, (0, 2, 3, 1)) 
+```
+3. The output of the inference is `numpy array`, you need to transfer it to torch and reshape it correctly. The shape of the output data corresponds to that of input data.
+
+Copy the input data, e.g., a sample 512x512 image and mask, to the following path:
+```
+C:\ai-hub\lama_dilated\test_input_image.png
+C:\ai-hub\lama_dilated\test_input_mask.png
+```
+
+Run the sample code:	
+```
+python lama_dilated.py
+```
+
+Special Note:
+
+1. When run the inference of some model e.g.  `fastSam_x` or `yolov8_det` which may use ultralytics library, please do not install the latest version in case of some error when running inference. Use the following command to install the designated version of ultralytics library:
+
+   ```
+   pip install ultralytics==8.0.193
+   ```
+
+2. When run the inference of some model e.g.  `fastSam_x` or `yolov8_det` which may use the function `torchvision.ops.nms` to postprocess the result of inference, you may meet some error like:
+   ```
+   Traceback (most recent call last):
+     File "C:\taonan\fastsam_x\fastsam_x.py", line 135, in <module>
+       Inference("in.jpg", "out.jpg")
+     File "C:\taonan\fastsam_x\fastsam_x.py", line 94, in Inference
+       p = ops.non_max_suppression(
+           ^^^^^^^^^^^^^^^^^^^^^^^^
+     File "C:\Programs\Python\Python311-arm64\Lib\site-packages\ultralytics\utils\ops.py", line 291, in non_max_suppression
+       i = torchvision.ops.nms(boxes, scores, iou_thres)  # NMS
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+     File "C:\Programs\Python\Python311-arm64\Lib\site-packages\torchvision\ops\boxes.py", line 41, in nms
+       return torch.ops.torchvision.nms(boxes, scores, iou_threshold)
+              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+     File "C:\Programs\Python\Python311-arm64\Lib\site-packages\torch\_ops.py", line 854, in __call__
+       return self_._op(*args, **(kwargs or {}))
+              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+   NotImplementedError: Could not run 'torchvision::nms' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'torchvision::nms' is only available for these backends: [CUDA, Meta, QuantizedCUDA, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastXPU, AutocastCUDA, AutocastPrivateUse1, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].
+   
+   ```
+
+   It may be caused by the difference of environment between WoS and x86, so please replace the function `torchvision.ops.nms` with another function made by ourselves, you can name it as anything you like, e.g. `custom_nms`:
+   ```
+   def custom_nms(boxes, scores, iou_threshold):
+       '''
+       self definition of nms function cause nms from torch is not avaliable on this device without cuda
+       '''
+       
+       if len(boxes) == 0:
+           return torch.empty((0,), dtype=torch.int64)
+       
+       # transfer to numpy array
+       boxes_np = boxes.cpu().numpy()
+       scores_np = scores.cpu().numpy()
+   
+       # get the coor of boxes
+       x1 = boxes_np[:, 0]
+       y1 = boxes_np[:, 1]
+       x2 = boxes_np[:, 2]
+       y2 = boxes_np[:, 3]
+   
+       # compute the area of each single boxes
+       areas = (x2 - x1 + 1) * (y2 - y1 + 1)
+       order = scores_np.argsort()[::-1]
+   
+       keep = []
+       while order.size > 0:
+           i = order[0]
+           keep.append(i)
+           xx1 = np.maximum(x1[i], x1[order[1:]])
+           yy1 = np.maximum(y1[i], y1[order[1:]])
+           xx2 = np.minimum(x2[i], x2[order[1:]])
+           yy2 = np.minimum(y2[i], y2[order[1:]])
+   
+           w = np.maximum(0.0, xx2 - xx1 + 1)
+           h = np.maximum(0.0, yy2 - yy1 + 1)
+           inter = w * h
+           ovr = inter / (areas[i] + areas[order[1:]] - inter)
+   
+           inds = np.where(ovr <= iou_threshold)[0]
+           order = order[inds + 1]
+   
+       return torch.tensor(keep, dtype=torch.int64)
+   ```
+   
+   3. When run the inference of model `fastsam_x`, specifically the `text_prompt` function of this model shown by the following code:
+   
+      ```
+      prompt_process = FastSAMPrompt(image_path[0], results, device="cpu")
+      segmented_result = prompt_process.text_prompt(text='the yellow dog')
+      ```
+   
+      please change the function `def text_prompt(self, text):` of the following python file: 
+   
+      ````
+      C:\Programs\Python\Python311-arm64\Lib\site-packages\ultralytics\models\fastsam\prompt.py
+      ````
+   
+      specifically change the code `self.results[0].masks.data = torch.tensor(np.array([ann['segmentation'] for ann in annotations]))` to 
+   
+      `self.results[0].masks.data = torch.tensor(np.array([annotations[max_idx]['segmentation']]))`.
+   
+      The detail of the changed function  `def text_prompt(self, text):` is as follows:
+   
+      ```
+      def text_prompt(self, text):
+      	if self.results[0].masks is not None:
+          	format_results = self._format_results(self.results[0], 0)
+              cropped_boxes, cropped_images, not_crop, filter_id, annotations = self._crop_image(format_results)
+              clip_model, preprocess = self.clip.load('ViT-B/32', device=self.device)
+              scores = self.retrieve(clip_model, preprocess, cropped_boxes, text, device=self.device)
+              max_idx = scores.argsort()
+              max_idx = max_idx[-1]
+              max_idx += sum(np.array(filter_id) <= int(max_idx))
+              self.results[0].masks.data = torch.tensor(np.array([annotations[max_idx]['segmentation']]))
+          return self.results
+      ```
+
+## Output
+
+The output, e.g. output image will be saved to the following path:
+```
+C:\ai-hub\model_name\out.png
+```
+
+Use the following command to compare the results on Windows on Snapdragon (WoS) platforms and the Host Cloud Device:
+```
+python -m qai_hub_models.models.model_name.demo --on-device --hub-model-id m1m6gor6q --device "Snapdragon X Elite CRD"
+```
+
+
+## Reference
+You need to setup the AppBuilder environment before you run the sample code. Below is the guide on how to setup the AppBuilder environment:
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
+
diff --git a/samples/python/aotgan/README.md b/samples/python/aotgan/README.md
new file mode 100644
index 0000000..1b5cf56
--- /dev/null
+++ b/samples/python/aotgan/README.md
@@ -0,0 +1,59 @@
+# aotgan Sample Code
+
+## Introduction
+This is sample code for using AppBuilder to load aotgan QNN model to HTP and execute inference to erase and in-paint part of given input image.
+
+## Setup AppBuilder environment and prepare QNN SDK libraries by referring to the links below: 
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
+
+Copy the QNN libraries from QNN SDK to below path:
+```
+C:\ai-hub\aotgan\qnn\libQnnHtpV73Skel.so
+C:\ai-hub\aotgan\qnn\QnnHtp.dll
+C:\ai-hub\aotgan\qnn\QnnHtpV73Stub.dll
+C:\ai-hub\aotgan\qnn\QnnSystem.dll
+C:\ai-hub\aotgan\qnn\libqnnhtpv73.cat
+```
+
+## aotgan QNN models
+Download the quantized aotgan QNN models from Qualcomm® AI Hub:
+https://aihub.qualcomm.com/compute/models/aotgan
+
+After downloaded the model, copy it to the following path:
+```
+"C:\ai-hub\aotgan\models\aotgan.bin"
+```
+
+## Run the sample code
+Download the sample code from the following link:
+https://github.com/quic/ai-engine-direct-helper/blob/main/Samples/aotgan/aotgan.py
+
+After downloaded the sample code, please copy it to the following path:
+```
+C:\ai-hub\aotgan\
+```
+
+Copy one sample 512x512 image and mask to following path:
+```
+C:\ai-hub\aotgan\test_input_image.png
+C:\ai-hub\aotgan\test_input_mask.png
+```
+
+Run the sample code:	
+```
+python aotgan.py
+```
+
+## Output
+The output image will be saved to the following path:
+```
+C:\ai-hub\aotgan\out.png
+```
+
+## Reference
+You need to setup the AppBuilder environment before you run the sample code. Below is the guide on how to setup the AppBuilder environment:
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
+
+
diff --git a/samples/python/aotgan/aotgan.py b/samples/python/aotgan/aotgan.py
new file mode 100644
index 0000000..8b1786a
--- /dev/null
+++ b/samples/python/aotgan/aotgan.py
@@ -0,0 +1,123 @@
+# ---------------------------------------------------------------------
+# Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+# ---------------------------------------------------------------------
+
+import os
+import numpy as np
+import torch
+import torchvision.transforms as transforms
+
+from PIL import Image
+from PIL.Image import fromarray as ImageFromArray
+from torch.nn.functional import interpolate, pad
+from torchvision import transforms
+from typing import Callable, Dict, List, Tuple
+
+from qai_appbuilder import (QNNContext, Runtime, LogLevel, ProfilingLevel, PerfProfile, QNNConfig)
+
+image_size = 512
+aotgan = None
+image_buffer = None
+
+
+def preprocess_PIL_image(image: Image) -> torch.Tensor:
+    """Convert a PIL image into a pyTorch tensor with range [0, 1] and shape NCHW."""
+    transform = transforms.Compose([transforms.Resize(image_size),      # bgr image
+                                    transforms.CenterCrop(image_size),
+                                    transforms.PILToTensor()])
+    img: torch.Tensor = transform(image)  # type: ignore
+    img = img.float().unsqueeze(0) / 255.0  # int 0 - 255 to float 0.0 - 1.0
+    return img
+
+def torch_tensor_to_PIL_image(data: torch.Tensor) -> Image:
+    """
+    Convert a Torch tensor (dtype float32) with range [0, 1] and shape CHW into PIL image CHW
+    """
+    out = torch.clip(data, min=0.0, max=1.0)
+    np_out = (out.detach().numpy() * 255).astype(np.uint8)
+    return ImageFromArray(np_out)
+   
+def preprocess_inputs(
+    pixel_values_or_image: Image,
+    mask_pixel_values_or_image: Image,
+) -> Dict[str, torch.Tensor]:
+
+    NCHW_fp32_torch_frames = preprocess_PIL_image(pixel_values_or_image)
+    NCHW_fp32_torch_masks = preprocess_PIL_image(mask_pixel_values_or_image)
+    
+    # The number of input images should equal the number of input masks.
+    if NCHW_fp32_torch_masks.shape[0] != 1:
+        NCHW_fp32_torch_masks = NCHW_fp32_torch_masks.tile(
+            (NCHW_fp32_torch_frames.shape[0], 1, 1, 1)
+        )
+  
+    # Mask input image
+    image_masked = (
+        NCHW_fp32_torch_frames * (1 - NCHW_fp32_torch_masks) + NCHW_fp32_torch_masks
+    )
+    
+    return {"image": image_masked, "mask": NCHW_fp32_torch_masks}
+
+# AotGan class which inherited from the class QNNContext.
+class AotGan(QNNContext):
+    def Inference(self, input_data, input_mask):
+        input_datas=[input_data, input_mask]
+        output_data = super().Inference(input_datas)[0]
+        return output_data
+        
+def Init():
+    global aotgan
+
+    # Config AppBuilder environment.
+    QNNConfig.Config(os.getcwd() + "\\qnn", Runtime.HTP, LogLevel.WARN, ProfilingLevel.BASIC)
+
+    # Instance for AotGan objects.
+    aotgan_model = "models\\aotgan.bin"
+    aotgan = AotGan("aotgan", aotgan_model)
+
+def Inference(input_image_path, input_mask_path, output_image_path):
+    global image_buffer
+
+    # Read and preprocess the image&mask.
+    image = Image.open(input_image_path)
+    mask = Image.open(input_mask_path)   
+    inputs = preprocess_inputs(image, mask)
+    image_masked, mask_torch = inputs["image"], inputs["mask"]         
+    image_masked = image_masked.numpy()
+    mask_torch = mask_torch.numpy()
+     
+    image_masked = np.transpose(image_masked, (0, 2, 3, 1)) 
+    mask_torch = np.transpose(mask_torch, (0, 2, 3, 1)) 
+             
+    # Burst the HTP.
+    PerfProfile.SetPerfProfileGlobal(PerfProfile.BURST)
+
+    # Run the inference.
+    output_image = aotgan.Inference([image_masked], [mask_torch])
+    
+    # Reset the HTP.
+    PerfProfile.RelPerfProfileGlobal()
+    
+    # show%save the result
+    output_image = torch.from_numpy(output_image)    
+    output_image = output_image.reshape(image_size, image_size, 3)  
+    output_image = torch.unsqueeze(output_image, 0)      
+    output_image = [torch_tensor_to_PIL_image(img) for img in output_image]
+    image_buffer = output_image[0]
+    image_buffer.save(output_image_path)
+    image_buffer.show()  
+
+    
+def Release():
+    global aotgan
+
+    # Release the resources.
+    del(aotgan)
+
+
+Init()
+
+Inference("input.png", "mask.png", "output.png")
+
+Release()
diff --git a/samples/python/aotgan/input.png b/samples/python/aotgan/input.png
new file mode 100644
index 0000000..e05e8b1
Binary files /dev/null and b/samples/python/aotgan/input.png differ
diff --git a/samples/python/aotgan/mask.png b/samples/python/aotgan/mask.png
new file mode 100644
index 0000000..51864ec
Binary files /dev/null and b/samples/python/aotgan/mask.png differ
diff --git a/samples/python/fastsam_x/README.md b/samples/python/fastsam_x/README.md
new file mode 100644
index 0000000..39f6ad9
--- /dev/null
+++ b/samples/python/fastsam_x/README.md
@@ -0,0 +1,58 @@
+# unet_segmentation Sample Code
+
+## Introduction
+This is sample code for using AppBuilder to load unet_segmentation QNN model to HTP and execute inference to produce a segmentation mask for an image.
+
+## Setup AppBuilder environment and prepare QNN SDK libraries by referring to the links below: 
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
+
+Copy the QNN libraries from QNN SDK to below path:
+```
+C:\ai-hub\unet_segmentation\qnn\libQnnHtpV73Skel.so
+C:\ai-hub\unet_segmentation\qnn\QnnHtp.dll
+C:\ai-hub\unet_segmentation\qnn\QnnHtpV73Stub.dll
+C:\ai-hub\unet_segmentation\qnn\QnnSystem.dll
+C:\ai-hub\unet_segmentation\qnn\libqnnhtpv73.cat
+```
+
+## unet_segmentation QNN models
+Download the quantized unet_segmentation QNN models from Qualcomm® AI Hub:
+https://aihub.qualcomm.com/compute/models/unet_segmentation
+
+After downloaded the model, copy it to the following path:
+```
+"C:\ai-hub\unet_segmentation\models\unet_segmentation.bin"
+```
+
+## Run the sample code
+Download the sample code from the following link:
+https://github.com/quic/ai-engine-direct-helper/blob/main/Samples/unet_segmentation/unet_segmentation.py
+
+After downloaded the sample code, please copy it to the following path:
+```
+C:\ai-hub\unet_segmentation\
+```
+
+Copy one sample image to following path:
+```
+C:\ai-hub\unet_segmentation\in.jpg
+```
+
+Run the sample code:	
+```
+python unet_segmentation.py
+```
+
+## Output
+The output image will be saved to the following path:
+```
+C:\ai-hub\unet_segmentation\out.jpg
+```
+
+## Reference
+You need to setup the AppBuilder environment before you run the sample code. Below is the guide on how to setup the AppBuilder environment:
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
+
+
diff --git a/samples/python/fastsam_x/fastsam_x.py b/samples/python/fastsam_x/fastsam_x.py
new file mode 100644
index 0000000..cfa6e47
--- /dev/null
+++ b/samples/python/fastsam_x/fastsam_x.py
@@ -0,0 +1,283 @@
+# ---------------------------------------------------------------------
+# Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+# ---------------------------------------------------------------------
+
+from __future__ import annotations
+
+import os
+import numpy as np
+import math
+import torch
+import torchvision.transforms as transforms
+
+from typing import Callable, Dict, List, Tuple
+from PIL import Image
+from PIL.Image import fromarray as ImageFromArray
+from torch.nn.functional import interpolate, pad
+from torchvision import transforms
+from ultralytics.engine.results import Results
+from ultralytics.models.fastsam import FastSAMPrompt
+from ultralytics.models.fastsam.utils import bbox_iou
+from ultralytics.utils import ops
+
+from qai_appbuilder import (QNNContext, Runtime, LogLevel, ProfilingLevel, PerfProfile, QNNConfig)
+
+fastsam = None
+confidence: float = 0.4,
+iou_threshold: float = 0.9,
+retina_masks: bool = True,
+model_image_input_shape: Tuple[int, int] = (640, 640)
+
+def preprocess_PIL_image(image: Image) -> torch.Tensor:
+    """Convert a PIL image into a pyTorch tensor with range [0, 1] and shape NCHW."""
+    transform = transforms.Compose([transforms.PILToTensor()])  # bgr image
+    img: torch.Tensor = transform(image)  # type: ignore
+    img = img.float().unsqueeze(0) / 255.0  # int 0 - 255 to float 0.0 - 1.0
+    return img
+
+def torch_tensor_to_PIL_image(data: torch.Tensor) -> Image:
+    """
+    Convert a Torch tensor (dtype float32) with range [0, 1] and shape CHW into PIL image CHW
+    """
+    out = torch.clip(data, min=0.0, max=1.0)
+    np_out = (out.permute(1, 2, 0).detach().numpy() * 255).astype(np.uint8)
+    return ImageFromArray(np_out)
+
+def resize_pad(image: torch.Tensor, dst_size: Tuple[int, int]):
+    """
+    Resize and pad image to be shape [..., dst_size[0], dst_size[1]]
+
+    Parameters:
+        image: (..., H, W)
+            Image to reshape.
+
+        dst_size: (height, width)
+            Size to which the image should be reshaped.
+
+    Returns:
+        rescaled_padded_image: torch.Tensor (..., dst_size[0], dst_size[1])
+        scale: scale factor between original image and dst_size image, (w, h)
+        pad: pixels of padding added to the rescaled image: (left_padding, top_padding)
+
+    Based on https://github.com/zmurez/MediaPipePyTorch/blob/master/blazebase.py
+    """
+    height, width = image.shape[-2:]
+    dst_frame_height, dst_frame_width = dst_size
+
+    h_ratio = dst_frame_height / height
+    w_ratio = dst_frame_width / width
+    scale = min(h_ratio, w_ratio)
+    if h_ratio < w_ratio:
+        scale = h_ratio
+        new_height = dst_frame_height
+        new_width = math.floor(width * scale)
+    else:
+        scale = w_ratio
+        new_height = math.floor(height * scale)
+        new_width = dst_frame_width
+
+    new_height = math.floor(height * scale)
+    new_width = math.floor(width * scale)
+    pad_h = dst_frame_height - new_height
+    pad_w = dst_frame_width - new_width
+
+    pad_top = int(pad_h // 2)
+    pad_bottom = int(pad_h // 2 + pad_h % 2)
+    pad_left = int(pad_w // 2)
+    pad_right = int(pad_w // 2 + pad_w % 2)
+
+    rescaled_image = interpolate(
+        image, size=[int(new_height), int(new_width)], mode="bilinear"
+    )
+    rescaled_padded_image = pad(
+        rescaled_image, (pad_left, pad_right, pad_top, pad_bottom)
+    )
+    padding = (pad_left, pad_top)
+
+    return rescaled_padded_image, scale, padding
+
+def undo_resize_pad(
+    image: torch.Tensor,
+    orig_size_wh: Tuple[int, int],
+    scale: float,
+    padding: Tuple[int, int],
+):
+    """
+    Undos the efffect of resize_pad. Instead of scale, the original size
+    (in order width, height) is provided to prevent an off-by-one size.
+    """
+    width, height = orig_size_wh
+
+    rescaled_image = interpolate(image, scale_factor=1 / scale, mode="bilinear")
+
+    scaled_padding = [int(round(padding[0] / scale)), int(round(padding[1] / scale))]
+
+    cropped_image = rescaled_image[
+        ...,
+        scaled_padding[1] : scaled_padding[1] + height,
+        scaled_padding[0] : scaled_padding[0] + width,
+    ]
+
+    return cropped_image
+
+def pil_resize_pad(
+    image: Image, dst_size: Tuple[int, int]
+) -> Tuple[Image, float, Tuple[int, int]]:
+    torch_image = preprocess_PIL_image(image)
+    torch_out_image, scale, padding = resize_pad(
+        torch_image,
+        dst_size,
+    )
+    pil_out_image = torch_tensor_to_PIL_image(torch_out_image[0])
+    return (pil_out_image, scale, padding)
+    
+def pil_undo_resize_pad(
+    image: Image, orig_size_wh: Tuple[int, int], scale: float, padding: Tuple[int, int]
+) -> Image:
+    torch_image = preprocess_PIL_image(image)
+    torch_out_image = undo_resize_pad(torch_image, orig_size_wh, scale, padding)
+    pil_out_image = torch_tensor_to_PIL_image(torch_out_image[0])
+    return pil_out_image
+    
+# FastSam_x class which inherited from the class QNNContext.
+class FastSam(QNNContext):
+    def Inference(self, input_data):
+        input_datas=[input_data]
+        output_data = super().Inference(input_datas)
+        return output_data
+        
+def Init():
+    global fastsam
+
+    # Config AppBuilder environment.
+    QNNConfig.Config(os.getcwd() + "\\qnn", Runtime.HTP, LogLevel.WARN, ProfilingLevel.BASIC)
+
+    # Instance for FastSam_x objects.
+    fastsam_model = "models\\fastsam_x.bin"
+    fastsam = FastSam("fastsam", fastsam_model)
+
+def Inference(input_image_path, output_image_path): 
+    global confidence, iou_threshold, retina_masks, model_image_input_shape
+
+    # Read and preprocess the image.
+    original_image = Image.open(input_image_path)
+    resized_image, scale, padding = pil_resize_pad(original_image, (model_image_input_shape[0], model_image_input_shape[1]))
+    
+    Img = preprocess_PIL_image(resized_image)
+    img = preprocess_PIL_image(resized_image).numpy()
+    img = np.transpose(img, (0, 2, 3, 1))
+    original_image = np.array(original_image)
+
+    # Burst the HTP.
+    PerfProfile.SetPerfProfileGlobal(PerfProfile.BURST)
+
+    # Run the inference.
+    image_path = ["out"]
+    preds = fastsam.Inference([img])
+    
+    # Reset the HTP.
+    PerfProfile.RelPerfProfileGlobal()
+    
+    # Post Processing
+    preds = [
+        torch.tensor(preds[0]).reshape(1, 32, 160, 160),
+        torch.tensor(preds[1]).reshape(1, 32, 8400),
+        torch.tensor(preds[2]).reshape(1, 105, 80, 80),
+        torch.tensor(preds[3]).reshape(1, 105, 40, 40),
+        torch.tensor(preds[4]).reshape(1, 105, 20, 20),
+        torch.tensor(preds[5]).reshape(1, 37, 8400)
+    ]
+    
+    preds = tuple(
+        (preds[5], tuple(([preds[2], preds[3], preds[4]], preds[1], preds[0])))
+    )
+    
+    p = ops.non_max_suppression(
+        preds[0],
+        0.4,
+        0.9,
+        agnostic=False,
+        max_det=100,
+        nc=1,  # set to 1 class since SAM has no class predictions
+        classes=None,
+    )
+    
+    full_box = torch.zeros(p[0].shape[1], device=p[0].device)
+    full_box[2], full_box[3], full_box[4], full_box[6:] = (
+        Img.shape[3],
+        Img.shape[2],
+        1.0,
+        1.0,
+    )
+    full_box = full_box.view(1, -1)
+    critical_iou_index = bbox_iou(
+        full_box[0][:4], p[0][:, :4], iou_thres=0.9, image_shape=Img.shape[2:]
+    )
+    if critical_iou_index.numel() != 0:
+        full_box[0][4] = p[0][critical_iou_index][:, 4]
+        full_box[0][6:] = p[0][critical_iou_index][:, 6:]
+        p[0][critical_iou_index] = full_box
+
+    results = []
+    proto = (
+        preds[1][-1] if len(preds[1]) == 3 else preds[1]
+    )  # second output is len 3 if pt, but only 1 if exported
+    for i, pred in enumerate(p):
+        orig_img = original_image
+        img_path = image_path[0][i]
+        # No predictions, no masks
+        if not len(pred):
+            masks = None
+        elif retina_masks:
+            pred[:, :4] = ops.scale_boxes(
+                Img.shape[2:], pred[:, :4], orig_img.shape
+            )
+
+            masks = ops.process_mask_native(
+                proto[i], pred[:, 6:], pred[:, :4], orig_img.shape[:2]
+            )  # HWC
+        else:
+            masks = ops.process_mask(
+                proto[i], pred[:, 6:], pred[:, :4], Img.shape[2:], upsample=True
+            )  # HWC
+            pred[:, :4] = ops.scale_boxes(
+                Img.shape[2:], pred[:, :4], orig_img.shape
+            )
+        results.append(
+            Results(
+                orig_img,
+                path=img_path,
+                names="fastsam",
+                boxes=pred[:, :6],
+                masks=masks,
+            )
+        )
+    
+    # Get segmented_result
+    prompt_process = FastSAMPrompt(image_path[0], results, device="cpu")
+    #segmented_result = prompt_process.everything_prompt()
+    #segmented_result = prompt_process.text_prompt(text='the yellow dog')
+    segmented_result = prompt_process.point_prompt(points=[[290, 433]], pointlabel=[1])
+    #segmented_result = prompt_process.box_prompt([320, 80, 40, 40])
+    prompt_process.plot(annotations=segmented_result, output="output")
+    
+    # Get Binary Mask Result
+    binary_mask = segmented_result[0].masks.data.squeeze().cpu().numpy().astype(np.uint8)
+    binary_mask = binary_mask * 255
+    mask_image = Image.fromarray(binary_mask)
+    mask_image.show()
+    mask_image.save(output_image_path)
+        
+def Release():
+    global fastsam
+
+    # Release the resources.
+    del(fastsam)
+
+
+Init()
+
+Inference("input.jpg", "output.jpg")
+
+Release()
diff --git a/samples/python/fastsam_x/input.jpg b/samples/python/fastsam_x/input.jpg
new file mode 100644
index 0000000..71418ae
Binary files /dev/null and b/samples/python/fastsam_x/input.jpg differ
diff --git a/samples/python/inception_v3/README.md b/samples/python/inception_v3/README.md
new file mode 100644
index 0000000..1b0873d
--- /dev/null
+++ b/samples/python/inception_v3/README.md
@@ -0,0 +1,65 @@
+# inception_v3 Sample Code
+
+## Introduction
+This is sample code for using AppBuilder to load inception_v3 QNN model to HTP and execute inference to classify images from the Imagenet dataset. 
+It can also be used as a backbone in building more complex models for specific use cases.
+
+## Setup AppBuilder environment and prepare QNN SDK libraries by referring to the links below: 
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
+
+Copy the QNN libraries from QNN SDK to below path:
+```
+C:\ai-hub\inception_v3\qnn\libQnnHtpV73Skel.so
+C:\ai-hub\inception_v3\qnn\QnnHtp.dll
+C:\ai-hub\inception_v3\qnn\QnnHtpV73Stub.dll
+C:\ai-hub\inception_v3\qnn\QnnSystem.dll
+C:\ai-hub\inception_v3\qnn\libqnnhtpv73.cat
+```
+
+## inception_v3 QNN models
+Download the quantized inception_v3 QNN models from Qualcomm® AI Hub:
+https://aihub.qualcomm.com/compute/models/inception_v3
+
+After downloaded the model, copy it to the following path:
+```
+"C:\ai-hub\inception_v3\models\inception_v3.bin"
+```
+
+## Run the sample code
+Download the sample code from the following link:
+https://github.com/quic/ai-engine-direct-helper/blob/main/Samples/inception_v3/inception_v3.py
+
+After downloaded the sample code, please copy it to the following path:
+```
+C:\ai-hub\inception_v3\
+```
+
+Copy one sample image to following path:
+```
+C:\ai-hub\inception_v3\in.jpg
+```
+
+Run the sample code:	
+```
+python inception_v3.py
+```
+
+## Output
+The output will be shown in the terminal:
+```
+Top 5 predictions for image:
+
+Samoyed 0.999897
+Alaskan tundra wolf  0.00828
+Arctic fox  0.00118
+husky 0.00026
+Pyrenean Mountain Dog 0.000204
+```
+
+## Reference
+You need to setup the AppBuilder environment before you run the sample code. Below is the guide on how to setup the AppBuilder environment:
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
+
+
diff --git a/samples/python/inception_v3/inception_v3.py b/samples/python/inception_v3/inception_v3.py
new file mode 100644
index 0000000..ef93dfa
--- /dev/null
+++ b/samples/python/inception_v3/inception_v3.py
@@ -0,0 +1,92 @@
+# ---------------------------------------------------------------------
+# Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+# ---------------------------------------------------------------------
+
+import os
+import cv2
+import numpy as np
+
+import torch
+import torchvision.transforms as transforms
+from PIL import Image
+from PIL.Image import fromarray as ImageFromArray
+
+from qai_appbuilder import (QNNContext, Runtime, LogLevel, ProfilingLevel, PerfProfile, QNNConfig)
+
+####################################################################
+
+inceptionV3 = None
+
+def preprocess_PIL_image(image: Image) -> torch.Tensor:
+    preprocess = transforms.Compose([
+        transforms.Resize(256),
+        transforms.CenterCrop(224),
+        transforms.ToTensor(),
+    ])
+
+    img = preprocess(image)
+    img = img.unsqueeze(0)
+    return img
+
+def post_process(probabilities, output):
+    # Read the categories
+    with open("imagenet_classes.txt", "r") as f:
+        categories = [s.strip() for s in f.readlines()]
+    # Show top categories per image
+    top5_prob, top5_catid = torch.topk(probabilities, 5)
+    print("Top 5 predictions for image:\n")
+    for i in range(top5_prob.size(0)):
+        print(categories[top5_catid[i]], top5_prob[i].item())
+
+# InceptionV3 class which inherited from the class QNNContext.
+class InceptionV3(QNNContext):
+    def Inference(self, input_data):
+        input_datas=[input_data]
+        output_data = super().Inference(input_datas)[0]        
+        return output_data
+
+def Init():
+    global inceptionV3
+
+    # Config AppBuilder environment.
+    QNNConfig.Config(os.getcwd() + "\\qnn", Runtime.HTP, LogLevel.WARN, ProfilingLevel.BASIC)
+
+    # Instance for InceptionV3 objects.
+    inceptionV3_model = "models\\inception_v3.bin"
+    inceptionV3 = InceptionV3("inceptionV3", inceptionV3_model)
+
+def Inference(input_image_path, output_image_path):
+    # Read and preprocess the image.
+    image = Image.open(input_image_path)
+    image = preprocess_PIL_image(image).numpy()
+    image = np.transpose(image, (0, 2, 3, 1))
+     
+    # Burst the HTP.
+    PerfProfile.SetPerfProfileGlobal(PerfProfile.BURST)
+
+    # Run the inference.
+    output_data = inceptionV3.Inference([image])
+
+    # Reset the HTP.
+    PerfProfile.RelPerfProfileGlobal()
+    
+    # show the Top 5 predictions for image
+    output = torch.from_numpy(output_data)  
+    probabilities = torch.softmax(output, dim=0)
+    post_process(probabilities, output)
+    
+
+def Release():
+    global inceptionV3
+
+    # Release the resources.
+    del(inceptionV3)
+
+
+Init()
+
+Inference("input.jpg", "output.jpg")
+
+Release()
+
diff --git a/samples/python/inception_v3/input.jpg b/samples/python/inception_v3/input.jpg
new file mode 100644
index 0000000..12f0e0d
Binary files /dev/null and b/samples/python/inception_v3/input.jpg differ
diff --git a/samples/python/lama_dilated/README.md b/samples/python/lama_dilated/README.md
new file mode 100644
index 0000000..1a78f9d
--- /dev/null
+++ b/samples/python/lama_dilated/README.md
@@ -0,0 +1,59 @@
+# lama_dilated Sample Code
+
+## Introduction
+This is sample code for using AppBuilder to load lama_dilated QNN model to HTP and execute inference to erase and in-paint part of given input image.
+
+## Setup AppBuilder environment and prepare QNN SDK libraries by referring to the links below: 
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
+
+Copy the QNN libraries from QNN SDK to below path:
+```
+C:\ai-hub\lama_dilated\qnn\libQnnHtpV73Skel.so
+C:\ai-hub\lama_dilated\qnn\QnnHtp.dll
+C:\ai-hub\lama_dilated\qnn\QnnHtpV73Stub.dll
+C:\ai-hub\lama_dilated\qnn\QnnSystem.dll
+C:\ai-hub\lama_dilated\qnn\libqnnhtpv73.cat
+```
+
+## lama_dilated QNN models
+Download the quantized lama_dilated QNN models from Qualcomm® AI Hub:
+https://aihub.qualcomm.com/compute/models/lama_dilated
+
+After downloaded the model, copy it to the following path:
+```
+"C:\ai-hub\lama_dilated\models\lama_dilated.bin"
+```
+
+## Run the sample code
+Download the sample code from the following link:
+https://github.com/quic/ai-engine-direct-helper/blob/main/Samples/lama_dilated/lama_dilated.py
+
+After downloaded the sample code, please copy it to the following path:
+```
+C:\ai-hub\lama_dilated\
+```
+
+Copy one sample 512x512 image and mask to following path:
+```
+C:\ai-hub\lama_dilated\test_input_image.png
+C:\ai-hub\lama_dilated\test_input_mask.png
+```
+
+Run the sample code:	
+```
+python lama_dilated.py
+```
+
+## Output
+The output image will be saved to the following path:
+```
+C:\ai-hub\lama_dilated\out.png
+```
+
+## Reference
+You need to setup the AppBuilder environment before you run the sample code. Below is the guide on how to setup the AppBuilder environment:
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
+
+
diff --git a/samples/python/lama_dilated/input.png b/samples/python/lama_dilated/input.png
new file mode 100644
index 0000000..e05e8b1
Binary files /dev/null and b/samples/python/lama_dilated/input.png differ
diff --git a/samples/python/lama_dilated/lama_dilated.py b/samples/python/lama_dilated/lama_dilated.py
new file mode 100644
index 0000000..5629db2
--- /dev/null
+++ b/samples/python/lama_dilated/lama_dilated.py
@@ -0,0 +1,122 @@
+# ---------------------------------------------------------------------
+# Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+# ---------------------------------------------------------------------
+
+import os
+import numpy as np
+import torch
+import torchvision.transforms as transforms
+
+from PIL import Image
+from PIL.Image import fromarray as ImageFromArray
+from torch.nn.functional import interpolate, pad
+from torchvision import transforms
+from typing import Callable, Dict, List, Tuple
+
+from qai_appbuilder import (QNNContext, Runtime, LogLevel, ProfilingLevel, PerfProfile, QNNConfig)
+
+image_size = 512
+lamadilated = None
+image_buffer = None
+
+
+def preprocess_PIL_image(image: Image) -> torch.Tensor:
+    """Convert a PIL image into a pyTorch tensor with range [0, 1] and shape NCHW."""
+    transform = transforms.Compose([transforms.Resize(image_size),      # bgr image
+                                    transforms.CenterCrop(image_size),
+                                    transforms.PILToTensor()])
+    img: torch.Tensor = transform(image)  # type: ignore
+    img = img.float().unsqueeze(0) / 255.0  # int 0 - 255 to float 0.0 - 1.0
+    return img
+
+def torch_tensor_to_PIL_image(data: torch.Tensor) -> Image:
+    """
+    Convert a Torch tensor (dtype float32) with range [0, 1] and shape CHW into PIL image CHW
+    """
+    out = torch.clip(data, min=0.0, max=1.0)
+    np_out = (out.detach().numpy() * 255).astype(np.uint8)
+    return ImageFromArray(np_out)
+   
+def preprocess_inputs(
+    pixel_values_or_image: Image,
+    mask_pixel_values_or_image: Image,
+) -> Dict[str, torch.Tensor]:
+
+    NCHW_fp32_torch_frames = preprocess_PIL_image(pixel_values_or_image)
+    NCHW_fp32_torch_masks = preprocess_PIL_image(mask_pixel_values_or_image)
+    
+    # The number of input images should equal the number of input masks.
+    if NCHW_fp32_torch_masks.shape[0] != 1:
+        NCHW_fp32_torch_masks = NCHW_fp32_torch_masks.tile(
+            (NCHW_fp32_torch_frames.shape[0], 1, 1, 1)
+        )
+  
+    # Mask input image
+    image_masked = (
+        NCHW_fp32_torch_frames * (1 - NCHW_fp32_torch_masks) + NCHW_fp32_torch_masks
+    )
+    
+    return {"image": image_masked, "mask": NCHW_fp32_torch_masks}
+
+# LamaDilated class which inherited from the class QNNContext.
+class LamaDilated(QNNContext):
+    def Inference(self, input_data, input_mask):
+        input_datas=[input_data, input_mask]
+        output_data = super().Inference(input_datas)[0]
+        return output_data
+        
+def Init():
+    global lamadilated
+
+    # Config AppBuilder environment.
+    QNNConfig.Config(os.getcwd() + "\\qnn", Runtime.HTP, LogLevel.WARN, ProfilingLevel.BASIC)
+
+    # Instance for LamaDilated objects.
+    lamadilated_model = "models\\lama_dilated.bin"
+    lamadilated = LamaDilated("lamadilated", lamadilated_model)
+
+def Inference(input_image_path, input_mask_path, output_image_path):
+    global image_buffer
+
+    # Read and preprocess the image&mask.
+    image = Image.open(input_image_path)
+    mask = Image.open(input_mask_path)   
+    inputs = preprocess_inputs(image, mask)
+    image_masked, mask_torch = inputs["image"], inputs["mask"]         
+    image_masked = image_masked.numpy()
+    mask_torch = mask_torch.numpy()
+     
+    image_masked = np.transpose(image_masked, (0, 2, 3, 1)) 
+    mask_torch = np.transpose(mask_torch, (0, 2, 3, 1)) 
+             
+    # Burst the HTP.
+    PerfProfile.SetPerfProfileGlobal(PerfProfile.BURST)
+
+    # Run the inference.
+    output_image = lamadilated.Inference([image_masked], [mask_torch])
+    
+    # Reset the HTP.
+    PerfProfile.RelPerfProfileGlobal()
+    
+    # show%save the result
+    output_image = torch.from_numpy(output_image)    
+    output_image = output_image.reshape(image_size, image_size, 3)  
+    output_image = torch.unsqueeze(output_image, 0)      
+    output_image = [torch_tensor_to_PIL_image(img) for img in output_image]
+    image_buffer = output_image[0]
+    image_buffer.show()  
+    image_buffer.save(output_image_path)
+
+def Release():
+    global lamadilated
+
+    # Release the resources.
+    del(lamadilated)
+
+
+Init()
+
+Inference("input.jpg", "mask.jpg", "output.jpg")
+
+Release()
diff --git a/samples/python/lama_dilated/mask.png b/samples/python/lama_dilated/mask.png
new file mode 100644
index 0000000..51864ec
Binary files /dev/null and b/samples/python/lama_dilated/mask.png differ
diff --git a/samples/python/openpose/README.md b/samples/python/openpose/README.md
new file mode 100644
index 0000000..bb1e0d0
--- /dev/null
+++ b/samples/python/openpose/README.md
@@ -0,0 +1,58 @@
+# openpose Sample Code
+
+## Introduction
+This is sample code for using AppBuilder to load openpose QNN model to HTP and execute inference to pestimate body and hand pose in an image and return location and confidence for each of 19 joints..
+
+## Setup AppBuilder environment and prepare QNN SDK libraries by referring to the links below: 
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
+
+Copy the QNN libraries from QNN SDK to below path:
+```
+C:\ai-hub\openpose\qnn\libQnnHtpV73Skel.so
+C:\ai-hub\openpose\qnn\QnnHtp.dll
+C:\ai-hub\openpose\qnn\QnnHtpV73Stub.dll
+C:\ai-hub\openpose\qnn\QnnSystem.dll
+C:\ai-hub\openpose\qnn\libqnnhtpv73.cat
+```
+
+## openpose QNN models
+Download the quantized openpose QNN models from Qualcomm® AI Hub:
+https://aihub.qualcomm.com/compute/models/openpose
+
+After downloaded the model, copy it to the following path:
+```
+"C:\ai-hub\openpose\models\openpose.bin"
+```
+
+## Run the sample code
+Download the sample code from the following link:
+https://github.com/quic/ai-engine-direct-helper/blob/main/Samples/openpose/openpose.py
+
+After downloaded the sample code, please copy it to the following path:
+```
+C:\ai-hub\openpose\
+```
+
+Copy one sample image to following path:
+```
+C:\ai-hub\openpose\in.jpg
+```
+
+Run the sample code:	
+```
+python openpose.py
+```
+
+## Output
+The output image will be saved to the following path:
+```
+C:\ai-hub\openpose\out.jpg
+```
+
+## Reference
+You need to setup the AppBuilder environment before you run the sample code. Below is the guide on how to setup the AppBuilder environment:
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
+
+
diff --git a/samples/python/openpose/input.png b/samples/python/openpose/input.png
new file mode 100644
index 0000000..10de8cb
Binary files /dev/null and b/samples/python/openpose/input.png differ
diff --git a/samples/python/openpose/openpose.py b/samples/python/openpose/openpose.py
new file mode 100644
index 0000000..6f18cd1
--- /dev/null
+++ b/samples/python/openpose/openpose.py
@@ -0,0 +1,493 @@
+# ---------------------------------------------------------------------
+# Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+# ---------------------------------------------------------------------
+
+import os
+import numpy as np
+import math
+import torch
+import PIL
+import torchvision.transforms as transforms
+
+from PIL import Image
+from PIL.Image import fromarray as ImageFromArray
+import torch.nn.functional as F
+from torch.nn.functional import interpolate, pad
+from torchvision import transforms
+from typing import Callable, Dict, List, Tuple
+from scipy.ndimage.filters import gaussian_filter
+
+from qai_appbuilder import (QNNContext, Runtime, LogLevel, ProfilingLevel, PerfProfile, QNNConfig)
+
+openpose = None
+
+
+def preprocess_PIL_image(image: Image) -> torch.Tensor:
+    """Convert a PIL image into a pyTorch tensor with range [0, 1] and shape NCHW."""
+    transform = transforms.Compose([transforms.PILToTensor()])  # bgr image
+    img: torch.Tensor = transform(image)  # type: ignore
+    img = img.unsqueeze(0) / 255.0  # int 0 - 255 to float 0.0 - 1.0
+    return img
+
+def torch_tensor_to_PIL_image(data: torch.Tensor) -> Image:
+    """
+    Convert a Torch tensor (dtype float32) with range [0, 1] and shape CHW into PIL image CHW
+    """
+    out = torch.clip(data, min=0.0, max=1.0)
+    np_out = (out.permute(1, 2, 0).detach().numpy() * 255).astype(np.uint8)
+    return ImageFromArray(np_out)
+
+def resize_pad(image: torch.Tensor, dst_size: Tuple[int, int]):
+    """
+    Resize and pad image to be shape [..., dst_size[0], dst_size[1]]
+
+    Parameters:
+        image: (..., H, W)
+            Image to reshape.
+
+        dst_size: (height, width)
+            Size to which the image should be reshaped.
+
+    Returns:
+        rescaled_padded_image: torch.Tensor (..., dst_size[0], dst_size[1])
+        scale: scale factor between original image and dst_size image, (w, h)
+        pad: pixels of padding added to the rescaled image: (left_padding, top_padding)
+
+    Based on https://github.com/zmurez/MediaPipePyTorch/blob/master/blazebase.py
+    """
+    height, width = image.shape[-2:]
+    dst_frame_height, dst_frame_width = dst_size
+
+    h_ratio = dst_frame_height / height
+    w_ratio = dst_frame_width / width
+    scale = min(h_ratio, w_ratio)
+    if h_ratio < w_ratio:
+        scale = h_ratio
+        new_height = dst_frame_height
+        new_width = math.floor(width * scale)
+    else:
+        scale = w_ratio
+        new_height = math.floor(height * scale)
+        new_width = dst_frame_width
+
+    new_height = math.floor(height * scale)
+    new_width = math.floor(width * scale)
+    pad_h = dst_frame_height - new_height
+    pad_w = dst_frame_width - new_width
+
+    pad_top = int(pad_h // 2)
+    pad_bottom = int(pad_h // 2 + pad_h % 2)
+    pad_left = int(pad_w // 2)
+    pad_right = int(pad_w // 2 + pad_w % 2)
+
+    rescaled_image = interpolate(
+        image, size=[int(new_height), int(new_width)], mode="bilinear"
+    )
+    rescaled_padded_image = pad(
+        rescaled_image, (pad_left, pad_right, pad_top, pad_bottom)
+    )
+    padding = (pad_left, pad_top)
+
+    return rescaled_padded_image, scale, padding
+
+def undo_resize_pad(
+    image: torch.Tensor,
+    orig_size_wh: Tuple[int, int],
+    scale: float,
+    padding: Tuple[int, int],
+):
+    """
+    Undos the efffect of resize_pad. Instead of scale, the original size
+    (in order width, height) is provided to prevent an off-by-one size.
+    """
+    width, height = orig_size_wh
+
+    rescaled_image = interpolate(image, scale_factor=1 / scale, mode="bilinear")
+
+    scaled_padding = [int(round(padding[0] / scale)), int(round(padding[1] / scale))]
+
+    cropped_image = rescaled_image[
+        ...,
+        scaled_padding[1] : scaled_padding[1] + height,
+        scaled_padding[0] : scaled_padding[0] + width,
+    ]
+
+    return cropped_image
+
+def pil_resize_pad(
+    image: Image, dst_size: Tuple[int, int]
+) -> Tuple[Image, float, Tuple[int, int]]:
+    torch_image = preprocess_PIL_image(image)
+    torch_out_image, scale, padding = resize_pad(
+        torch_image,
+        dst_size,
+    )
+    pil_out_image = torch_tensor_to_PIL_image(torch_out_image[0])
+    return (pil_out_image, scale, padding)
+
+def pil_undo_resize_pad(
+    image: Image, orig_size_wh: Tuple[int, int], scale: float, padding: Tuple[int, int]
+) -> Image:
+    torch_image = preprocess_PIL_image(image)
+    torch_out_image = undo_resize_pad(torch_image, orig_size_wh, scale, padding)
+    pil_out_image = torch_tensor_to_PIL_image(torch_out_image[0])
+    return pil_out_image
+
+def getKeypointsFromPredictions(
+    paf: torch.Tensor, heatmap: torch.Tensor, h, w
+) -> Tuple[np.ndarray, np.ndarray]:
+    # upsample the PAF and heatmap to be the same size as the original image
+    target_size = (h, w)
+    upsampled_paf = (
+        F.interpolate(paf, size=target_size, mode="bicubic", align_corners=False)
+        .detach()
+        .numpy()
+    )
+    heatmap = (
+        F.interpolate(heatmap, size=target_size, mode="bicubic", align_corners=False)
+        .detach()
+        .numpy()
+    )
+
+    # reshape for post processing
+    heatmap = np.transpose(heatmap.squeeze(), (1, 2, 0))
+    paf = np.transpose(upsampled_paf.squeeze(), (1, 2, 0))
+    
+    """
+    The following post-processing code comes from the pytorch openpose repo, at
+    https://github.com/Hzzone/pytorch-openpose/blob/5ee71dc10020403dc3def2bb68f9b77c40337ae2/src/body.py#L67C9-L67C9
+    """
+
+    all_peaks = []
+    peak_counter = 0
+    thre1 = 0.1
+    thre2 = 0.05
+
+    for part in range(18):
+        map_ori = heatmap[:, :, part]
+        one_heatmap = gaussian_filter(map_ori, sigma=3)
+
+        map_left = np.zeros(one_heatmap.shape)
+        map_left[1:, :] = one_heatmap[:-1, :]
+        map_right = np.zeros(one_heatmap.shape)
+        map_right[:-1, :] = one_heatmap[1:, :]
+        map_up = np.zeros(one_heatmap.shape)
+        map_up[:, 1:] = one_heatmap[:, :-1]
+        map_down = np.zeros(one_heatmap.shape)
+        map_down[:, :-1] = one_heatmap[:, 1:]
+
+        peaks_binary = np.logical_and.reduce(
+            (
+                one_heatmap >= map_left,
+                one_heatmap >= map_right,
+                one_heatmap >= map_up,
+                one_heatmap >= map_down,
+                one_heatmap > thre1,
+            )
+        )
+        peaks = list(
+            zip(np.nonzero(peaks_binary)[1], np.nonzero(peaks_binary)[0])
+        )  # note reverse
+        peaks_with_score = [x + (map_ori[x[1], x[0]],) for x in peaks]
+        peak_id = range(peak_counter, peak_counter + len(peaks))
+        peaks_with_score_and_id = [
+            peaks_with_score[i] + (peak_id[i],) for i in range(len(peak_id))
+        ]
+
+        all_peaks.append(peaks_with_score_and_id)
+        peak_counter += len(peaks)
+
+    # find connection in the specified sequence, center 29 is in the position 15
+    limbSeq = [
+        [2, 3],
+        [2, 6],
+        [3, 4],
+        [4, 5],
+        [6, 7],
+        [7, 8],
+        [2, 9],
+        [9, 10],
+        [10, 11],
+        [2, 12],
+        [12, 13],
+        [13, 14],
+        [2, 1],
+        [1, 15],
+        [15, 17],
+        [1, 16],
+        [16, 18],
+        [3, 17],
+        [6, 18],
+    ]
+    # the middle joints heatmap correpondence
+    mapIdx = [
+        [31, 32],
+        [39, 40],
+        [33, 34],
+        [35, 36],
+        [41, 42],
+        [43, 44],
+        [19, 20],
+        [21, 22],
+        [23, 24],
+        [25, 26],
+        [27, 28],
+        [29, 30],
+        [47, 48],
+        [49, 50],
+        [53, 54],
+        [51, 52],
+        [55, 56],
+        [37, 38],
+        [45, 46],
+    ]
+
+    connection_all = []
+    special_k = []
+    mid_num = 10
+
+    for k in range(len(mapIdx)):
+        score_mid = paf[:, :, [x - 19 for x in mapIdx[k]]]
+        candA = all_peaks[limbSeq[k][0] - 1]
+        candB = all_peaks[limbSeq[k][1] - 1]
+        nA = len(candA)
+        nB = len(candB)
+        indexA, indexB = limbSeq[k]
+        if nA != 0 and nB != 0:
+            connection_candidate = []
+            for i in range(nA):
+                for j in range(nB):
+                    vec = np.subtract(candB[j][:2], candA[i][:2])
+                    norm = math.sqrt(vec[0] * vec[0] + vec[1] * vec[1])
+                    norm = max(0.001, norm)
+                    vec = np.divide(vec, norm)
+
+                    startend = list(
+                        zip(
+                            np.linspace(candA[i][0], candB[j][0], num=mid_num),
+                            np.linspace(candA[i][1], candB[j][1], num=mid_num),
+                        )
+                    )
+
+                    vec_x = np.array(
+                        [
+                            score_mid[
+                                int(round(startend[index][1])),
+                                int(round(startend[index][0])),
+                                0,
+                            ]
+                            for index in range(len(startend))
+                        ]
+                    )
+                    vec_y = np.array(
+                        [
+                            score_mid[
+                                int(round(startend[index][1])),
+                                int(round(startend[index][0])),
+                                1,
+                            ]
+                            for index in range(len(startend))
+                        ]
+                    )
+
+                    score_midpts = np.multiply(vec_x, vec[0]) + np.multiply(
+                        vec_y, vec[1]
+                    )
+                    score_with_dist_prior = sum(score_midpts) / len(score_midpts) + min(
+                        0.5 * h / norm - 1, 0
+                    )
+                    criterion1 = len(np.nonzero(score_midpts > thre2)[0]) > 0.8 * len(
+                        score_midpts
+                    )
+                    criterion2 = score_with_dist_prior > 0
+                    if criterion1 and criterion2:
+                        connection_candidate.append(
+                            [
+                                i,
+                                j,
+                                score_with_dist_prior,
+                                score_with_dist_prior + candA[i][2] + candB[j][2],
+                            ]
+                        )
+
+            connection_candidate = sorted(
+                connection_candidate, key=lambda x: x[2], reverse=True
+            )
+            connection = np.zeros((0, 5))
+            for c in range(len(connection_candidate)):
+                i, j, s = connection_candidate[c][0:3]
+                if i not in connection[:, 3] and j not in connection[:, 4]:
+                    connection = np.vstack(
+                        [connection, [candA[i][3], candB[j][3], s, i, j]]
+                    )
+                    if len(connection) >= min(nA, nB):
+                        break
+
+            connection_all.append(connection)
+        else:
+            special_k.append(k)
+            connection_all.append([])
+
+    # last number in each row is the total parts number of that person
+    # the second last number in each row is the score of the overall configuration
+    subset = -1 * np.ones((0, 20))
+    candidate = np.array([item for sublist in all_peaks for item in sublist])
+
+    for k in range(len(mapIdx)):
+        if k not in special_k:
+            partAs = connection_all[k][:, 0]
+            partBs = connection_all[k][:, 1]
+            indexA, indexB = np.array(limbSeq[k]) - 1
+
+            for i in range(len(connection_all[k])):  # = 1:size(temp,1)
+                found = 0
+                subset_idx = [-1, -1]
+                for j in range(len(subset)):  # 1:size(subset,1):
+                    if subset[j][indexA] == partAs[i] or subset[j][indexB] == partBs[i]:
+                        subset_idx[found] = j
+                        found += 1
+
+                if found == 1:
+                    j = subset_idx[0]
+                    if subset[j][indexB] != partBs[i]:
+                        subset[j][indexB] = partBs[i]
+                        subset[j][-1] += 1
+                        subset[j][-2] += (
+                            candidate[partBs[i].astype(int), 2]
+                            + connection_all[k][i][2]
+                        )
+                elif found == 2:  # if found 2 and disjoint, merge them
+                    j1, j2 = subset_idx
+                    membership = (
+                        (subset[j1] >= 0).astype(int) + (subset[j2] >= 0).astype(int)
+                    )[:-2]
+                    if len(np.nonzero(membership == 2)[0]) == 0:  # merge
+                        subset[j1][:-2] += subset[j2][:-2] + 1
+                        subset[j1][-2:] += subset[j2][-2:]
+                        subset[j1][-2] += connection_all[k][i][2]
+                        subset = np.delete(subset, j2, 0)
+                    else:  # as like found == 1
+                        subset[j1][indexB] = partBs[i]
+                        subset[j1][-1] += 1
+                        subset[j1][-2] += (
+                            candidate[partBs[i].astype(int), 2]
+                            + connection_all[k][i][2]
+                        )
+
+                # if find no partA in the subset, create a new subset
+                elif not found and k < 17:
+                    row = -1 * np.ones(20)
+                    row[indexA] = partAs[i]
+                    row[indexB] = partBs[i]
+                    row[-1] = 2
+                    row[-2] = (
+                        sum(candidate[connection_all[k][i, :2].astype(int), 2])
+                        + connection_all[k][i][2]
+                    )
+                    subset = np.vstack([subset, row])
+    # delete some rows of subset which has few parts occur
+    deleteIdx = []
+    for i in range(len(subset)):
+        if subset[i][-1] < 4 or subset[i][-2] / subset[i][-1] < 0.4:
+            deleteIdx.append(i)
+    subset = np.delete(subset, deleteIdx, axis=0)
+
+    # subset: n*20 array, 0-17 is the index in candidate, 18 is the total score, 19 is the total parts
+    # candidate: x, y, score, id
+    return candidate, subset
+
+
+def draw_keypoints(image: Image, keypoints: np.ndarray, radius=1, alpha=1.0):
+    overlay = image.copy()
+    draw = PIL.ImageDraw.Draw(overlay)
+    confidence_threshold = 0.8
+    for kp in keypoints:
+        x, y, v, i = kp
+        if v > confidence_threshold:
+            draw.ellipse(
+                (
+                    (int(x - radius), int(y - radius)),
+                    (int(x + radius), int(y + radius)),
+                ),
+                outline=(0, 255, 0),
+                fill=(0, 255, 0),
+            )
+
+    return PIL.Image.blend(overlay, image, alpha)
+
+# OpenPose class which inherited from the class QNNContext.
+class OpenPose(QNNContext):
+    def Inference(self, input_data):
+        input_datas=[input_data]
+        output_data = super().Inference(input_datas)
+        return output_data
+        
+def Init():
+    global openpose
+
+    # Config AppBuilder environment.
+    QNNConfig.Config(os.getcwd() + "\\qnn", Runtime.HTP, LogLevel.WARN, ProfilingLevel.BASIC)
+
+    # Instance for OpnPose objects.
+    openpose_model = "models\\openpose.bin"
+    openpose = OpenPose("openpose", openpose_model)
+
+def Inference(input_image_path, output_image_path): 
+    # Load image
+    orig_image = Image.open(input_image_path)
+    image, scale, padding = pil_resize_pad(orig_image, (224, 224))
+       
+    # preprocess
+    pixel_tensor = preprocess_PIL_image(image).numpy()
+    pixel_values = np.transpose(pixel_tensor, (0, 2, 3, 1))
+   
+    # Burst the HTP.
+    PerfProfile.SetPerfProfileGlobal(PerfProfile.BURST)
+    
+    # Run prediction
+    model_output = openpose.Inference([pixel_values])
+    
+    # Reset the HTP.
+    PerfProfile.RelPerfProfileGlobal()
+    
+    # postprocess the result
+    paf = model_output[0]
+    heatmap = model_output[1]
+    
+    paf = paf.reshape(1, 28, 28, 38)
+    heatmap = heatmap.reshape(1, 28, 28, 19)  
+    
+    paf_tensor = torch.from_numpy(paf)
+    heatmap_tensor = torch.from_numpy(heatmap)
+    
+    paf_tensor = paf_tensor.permute(0, 3, 1, 2)
+    heatmap_tensor = heatmap_tensor.permute(0, 3, 1, 2)
+   
+    # post process heatmaps and paf to get keypoints
+    keypoints, subset = getKeypointsFromPredictions(
+        paf_tensor, heatmap_tensor, pixel_tensor.shape[2], pixel_tensor.shape[3]
+    )
+
+    output_image = draw_keypoints(image, keypoints, radius=4, alpha=0.8)
+
+    # Resize / unpad annotated image
+    pred_image = pil_undo_resize_pad(output_image, orig_image.size, scale, padding)
+    
+    # show&save the result
+    pred_image.save(output_image_path)
+    pred_image.show()
+    
+  
+def Release():
+    global openpose
+
+    # Release the resources.
+    del(openpose)
+
+
+Init()
+
+Inference("input.png", "output.png")
+
+Release()
diff --git a/samples/python/real_esrgan_general_x4v3/README.md b/samples/python/real_esrgan_general_x4v3/README.md
new file mode 100644
index 0000000..9fcbe4e
--- /dev/null
+++ b/samples/python/real_esrgan_general_x4v3/README.md
@@ -0,0 +1,79 @@
+# real_esrgan_general_x4v3 Sample Code
+
+## Introduction
+This is sample code for using AppBuilder to load real_esrgan_general_x4v3 QNN model to HTP and execute inference to upscale an image with minimal loss in quality. 
+
+## Setup AppBuilder environment and prepare QNN SDK libraries by referring to the links below: 
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
+
+Copy the QNN libraries from QNN SDK to below path:
+```
+C:\ai-hub\real_esrgan_general_x4v3\qnn\libQnnHtpV73Skel.so
+C:\ai-hub\real_esrgan_general_x4v3\qnn\QnnHtp.dll
+C:\ai-hub\real_esrgan_general_x4v3\qnn\QnnHtpV73Stub.dll
+C:\ai-hub\real_esrgan_general_x4v3\qnn\QnnSystem.dll
+C:\ai-hub\real_esrgan_general_x4v3\qnn\libqnnhtpv73.cat
+```
+
+## real_esrgan_general_x4v3 QNN models
+The quantized real_esrgan_general_x4v3 QNN model's input resolution is 128x128 from Qualcomm® AI Hub:
+https://aihub.qualcomm.com/compute/models/real_esrgan_general_x4v3
+
+The input resolution is too small, we'll use AI Hub API to generate 512x512 QNN model.
+You can refer to below links on how to setup AI Hub envirinment and how to use AI Hub API:
+https://aihub.qualcomm.com/get-started
+http://app.aihub.qualcomm.com/docs/
+
+a. Download the latest 'ai-hub-models' code and install it to Python environment:
+```
+git clone --recursive https://github.com/quic/ai-hub-models.git
+pip install -e .
+```
+
+b. Use below commmand to generate QNN model which suppor 515x512 input resolution:
+```
+python -m qai_hub_models.models.real_esrgan_general_x4v3.export --device "Snapdragon X Elite CRD" --target-runtime qnn --height 512 --width 512
+```
+
+Part of the above command output as below, you can download the model through the link 'https://app.aihub.qualcomm.com/jobs/j1p86jxog/':
+```
+Optimizing model real_esrgan_general_x4v3 to run on-device
+Uploading model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64.7M/64.7M [00:21<00:00, 3.15MB/s]
+Scheduled compile job (j1p86jxog) successfully. To see the status and results:
+    https://app.aihub.qualcomm.com/jobs/j1p86jxog/
+```
+After downloaded the model, copy it to the following path:
+```
+C:\ai-hub\real_esrgan_general_x4v3\models\real_esrgan_general_x4v3_512.bin
+```
+
+## Run the sample code
+Download the sample code from the following link:
+https://github.com/quic/ai-engine-direct-helper/blob/main/Samples/real_esrgan_general_x4v3/real_esrgan_general_x4v3.py
+
+After downloaded the sample code, please copy it to the following path:
+```
+C:\ai-hub\real_esrgan_general_x4v3\
+```
+
+Copy one sample 512x512 image to following path:
+```
+C:\ai-hub\real_esrgan_general_x4v3\in.jpg
+```
+
+Run the sample code:
+```
+python real_esrgan_general_x4v3.py
+```
+
+## Output
+The output image will be saved to the following path:
+```
+C:\ai-hub\real_esrgan_general_x4v3\out.jpg
+```
+
+## Reference
+You need to setup the AppBuilder environment before you run the sample code. Below is the guide on how to setup the AppBuilder environment:
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
diff --git a/samples/python/real_esrgan_general_x4v3/input.png b/samples/python/real_esrgan_general_x4v3/input.png
new file mode 100644
index 0000000..5e0bd64
Binary files /dev/null and b/samples/python/real_esrgan_general_x4v3/input.png differ
diff --git a/samples/python/real_esrgan_general_x4v3/real_esrgan_general_x4v3.py b/samples/python/real_esrgan_general_x4v3/real_esrgan_general_x4v3.py
new file mode 100644
index 0000000..acf7ed9
--- /dev/null
+++ b/samples/python/real_esrgan_general_x4v3/real_esrgan_general_x4v3.py
@@ -0,0 +1,96 @@
+# ---------------------------------------------------------------------
+# Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+# ---------------------------------------------------------------------
+
+import os
+import cv2
+import numpy as np
+
+import torch
+import torchvision.transforms as transforms
+from PIL import Image
+from PIL.Image import fromarray as ImageFromArray
+
+from qai_appbuilder import (QNNContext, Runtime, LogLevel, ProfilingLevel, PerfProfile, QNNConfig)
+
+####################################################################
+
+execution_ws = os.getcwd()
+qnn_dir = execution_ws + "\\qnn"
+
+image_size = 512
+image_buffer = None
+realesrgan = None
+
+def preprocess_PIL_image(image: Image) -> torch.Tensor:
+    """Convert a PIL image into a pyTorch tensor with range [0, 1] and shape NCHW."""
+    transform = transforms.Compose([transforms.Resize(image_size),      # bgr image
+                                    transforms.CenterCrop(image_size),
+                                    transforms.PILToTensor()])
+    img: torch.Tensor = transform(image)  # type: ignore
+    img = img.float() / 255.0  # int 0 - 255 to float 0.0 - 1.0
+    return img
+
+def torch_tensor_to_PIL_image(data: torch.Tensor) -> Image:
+    """
+    Convert a Torch tensor (dtype float32) with range [0, 1] and shape CHW into PIL image CHW
+    """
+    out = torch.clip(data, min=0.0, max=1.0)
+    np_out = (out.detach().numpy() * 255).astype(np.uint8)
+    return ImageFromArray(np_out)
+
+# RealESRGan class which inherited from the class QNNContext.
+class RealESRGan(QNNContext):
+    def Inference(self, input_data):
+        input_datas=[input_data]
+        output_data = super().Inference(input_datas)[0]        
+        return output_data
+
+def Init():
+    global realesrgan
+
+    # Config AppBuilder environment.
+    QNNConfig.Config(qnn_dir, Runtime.HTP, LogLevel.WARN, ProfilingLevel.BASIC)
+
+    # Instance for RealESRGan objects.
+    realesrgan_model = "models\\real_esrgan_general_x4v3_512.bin"
+    realesrgan = RealESRGan("realesrgan", realesrgan_model)
+
+def Inference(input_image_path, output_image_path):
+    global image_buffer
+
+    # Read and preprocess the image.
+    image = Image.open(input_image_path)
+    image = preprocess_PIL_image(image).numpy()
+    image = np.transpose(image, (1, 2, 0))  # CHW -> HWC
+
+    # Burst the HTP.
+    PerfProfile.SetPerfProfileGlobal(PerfProfile.BURST)
+
+    # Run the inference.
+    output_image = realesrgan.Inference([image])
+
+    # Reset the HTP.
+    PerfProfile.RelPerfProfileGlobal()
+
+    output_image = torch.from_numpy(output_image)
+    output_image = output_image.reshape(image_size * 4, image_size * 4, 3)
+    output_image = torch.unsqueeze(output_image, 0)
+    output_image = [torch_tensor_to_PIL_image(img) for img in output_image]
+    image_buffer = output_image[0]
+    image_buffer.save(output_image_path)
+
+def Release():
+    global realesrgan
+
+    # Release the resources.
+    del(realesrgan)
+
+
+Init()
+
+Inference("input.png", "output.png")
+
+Release()
+
diff --git a/samples/python/real_esrgan_x4plus/README.md b/samples/python/real_esrgan_x4plus/README.md
new file mode 100644
index 0000000..d167469
--- /dev/null
+++ b/samples/python/real_esrgan_x4plus/README.md
@@ -0,0 +1,79 @@
+# real_esrgan_x4plus Sample Code
+
+## Introduction
+This is sample code for using AppBuilder to load real_esrgan_x4plus QNN model to HTP and execute inference to generate image. 
+
+## Setup AppBuilder environment and prepare QNN SDK libraries by referring to the links below: 
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
+
+Copy the QNN libraries from QNN SDK to below path:
+```
+C:\ai-hub\real_esrgan_x4plus\qnn\libQnnHtpV73Skel.so
+C:\ai-hub\real_esrgan_x4plus\qnn\QnnHtp.dll
+C:\ai-hub\real_esrgan_x4plus\qnn\QnnHtpV73Stub.dll
+C:\ai-hub\real_esrgan_x4plus\qnn\QnnSystem.dll
+C:\ai-hub\real_esrgan_x4plus\qnn\libqnnhtpv73.cat
+```
+
+## real_esrgan_x4plus QNN models
+The quantized real_esrgan_x4plus QNN model's input resolution is 128x128 from Qualcomm® AI Hub:
+https://aihub.qualcomm.com/compute/models/real_esrgan_x4plus
+
+The input resolution is too small, we'll use AI Hub API to generate 512x512 QNN model.
+You can refer to below links on how to setup AI Hub envirinment and how to use AI Hub API:
+https://aihub.qualcomm.com/get-started
+http://app.aihub.qualcomm.com/docs/
+
+a. Download the latest 'ai-hub-models' code and install it to Python environment:
+```
+git clone --recursive https://github.com/quic/ai-hub-models.git
+pip install -e .
+```
+
+b. Use below commmand to generate QNN model which suppor 515x512 input resolution:
+```
+python -m qai_hub_models.models.real_esrgan_x4plus.export --device "Snapdragon X Elite CRD" --target-runtime qnn --height 512 --width 512
+```
+
+Part of the above command output as below, you can download the model through the link 'https://app.aihub.qualcomm.com/jobs/j1p86jxog/':
+```
+Optimizing model real_esrgan_x4plus to run on-device
+Uploading model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64.7M/64.7M [00:21<00:00, 3.15MB/s]
+Scheduled compile job (j1p86jxog) successfully. To see the status and results:
+    https://app.aihub.qualcomm.com/jobs/j1p86jxog/
+```
+After downloaded the model, copy it to the following path:
+```
+C:\ai-hub\real_esrgan_x4plus\models\realesrgan_x4_512.bin
+```
+
+## Run the sample code
+Download the sample code from the following link:
+https://github.com/quic/ai-engine-direct-helper/blob/main/Samples/real_esrgan_x4plus/real_esrgan_x4plus.py
+
+After downloaded the sample code, please copy it to the following path:
+```
+C:\ai-hub\real_esrgan_x4plus\
+```
+
+Copy one sample 512x512 image to following path:
+```
+C:\ai-hub\real_esrgan_x4plus\in.jpg
+```
+
+Run the sample code:
+```
+python real_esrgan_x4plus.py
+```
+
+## Output
+The output image will be saved to the following path:
+```
+C:\ai-hub\real_esrgan_x4plus\out.jpg
+```
+
+## Reference
+You need to setup the AppBuilder environment before you run the sample code. Below is the guide on how to setup the AppBuilder environment:
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
diff --git a/samples/python/real_esrgan_x4plus/input.png b/samples/python/real_esrgan_x4plus/input.png
new file mode 100644
index 0000000..5e0bd64
Binary files /dev/null and b/samples/python/real_esrgan_x4plus/input.png differ
diff --git a/samples/python/real_esrgan_x4plus/real_esrgan_x4plus.py b/samples/python/real_esrgan_x4plus/real_esrgan_x4plus.py
new file mode 100644
index 0000000..f09421d
--- /dev/null
+++ b/samples/python/real_esrgan_x4plus/real_esrgan_x4plus.py
@@ -0,0 +1,95 @@
+# ---------------------------------------------------------------------
+# Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+# ---------------------------------------------------------------------
+
+import os
+import cv2
+import numpy as np
+
+import torch
+import torchvision.transforms as transforms
+from PIL import Image
+from PIL.Image import fromarray as ImageFromArray
+
+from qai_appbuilder import (QNNContext, Runtime, LogLevel, ProfilingLevel, PerfProfile, QNNConfig)
+
+####################################################################
+
+execution_ws = os.getcwd()
+qnn_dir = execution_ws + "\\qnn"
+
+image_size = 512
+image_buffer = None
+realesrgan = None
+
+def preprocess_PIL_image(image: Image) -> torch.Tensor:
+    """Convert a PIL image into a pyTorch tensor with range [0, 1] and shape NCHW."""
+    transform = transforms.Compose([transforms.Resize(image_size),      # bgr image
+                                    transforms.CenterCrop(image_size),
+                                    transforms.PILToTensor()])
+    img: torch.Tensor = transform(image)  # type: ignore
+    img = img.float().unsqueeze(0) / 255.0  # int 0 - 255 to float 0.0 - 1.0
+    return img
+
+def torch_tensor_to_PIL_image(data: torch.Tensor) -> Image:
+    """
+    Convert a Torch tensor (dtype float32) with range [0, 1] and shape CHW into PIL image CHW
+    """
+    out = torch.clip(data, min=0.0, max=1.0)
+    np_out = (out.permute(1, 2, 0).detach().numpy() * 255).astype(np.uint8)
+    return ImageFromArray(np_out)
+
+# RealESRGan class which inherited from the class QNNContext.
+class RealESRGan(QNNContext):
+    def Inference(self, input_data):
+        input_datas=[input_data]
+        output_data = super().Inference(input_datas)[0]        
+        return output_data
+
+def Init():
+    global realesrgan
+
+    # Config AppBuilder environment.
+    QNNConfig.Config(qnn_dir, Runtime.HTP, LogLevel.WARN, ProfilingLevel.BASIC)
+
+    # Instance for RealESRGan objects.
+    realesrgan_model = "models\\realesrgan_x4_512.bin"
+    realesrgan = RealESRGan("realesrgan", realesrgan_model)
+
+def Inference(input_image_path, output_image_path):
+    global image_buffer
+
+    # Read and preprocess the image.
+    image = Image.open(input_image_path)
+    image = preprocess_PIL_image(image).numpy()
+
+    # Burst the HTP.
+    PerfProfile.SetPerfProfileGlobal(PerfProfile.BURST)
+
+    # Run the inference.
+    output_image = realesrgan.Inference([image])
+
+    # Reset the HTP.
+    PerfProfile.RelPerfProfileGlobal()
+
+    output_image = torch.from_numpy(output_image)
+    output_image = output_image.reshape(3, image_size * 4, image_size * 4)
+    output_image = torch.unsqueeze(output_image, 0)
+    output_image = [torch_tensor_to_PIL_image(img) for img in output_image]
+    image_buffer = output_image[0]
+    image_buffer.save(output_image_path)
+
+def Release():
+    global realesrgan
+
+    # Release the resources.
+    del(realesrgan)
+
+
+Init()
+
+Inference("input.png", "output.png")
+
+Release()
+
diff --git a/samples/python/riffusion/README.md b/samples/python/riffusion/README.md
new file mode 100644
index 0000000..d935e00
--- /dev/null
+++ b/samples/python/riffusion/README.md
@@ -0,0 +1,122 @@
+# Riffusion Sample Code
+
+## Introduction
+This is sample code for using AppBuilder to load Riffusion QNN models to HTP and execute inference. 
+
+## Setup AppBuilder environment and prepare QNN SDK libraries by referring to the links below: 
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
+
+Copy the QNN libraries from QNN SDK to below path:
+```
+C:\ai-hub\Riffusion\qnn\libQnnHtpV73Skel.so
+C:\ai-hub\Riffusion\qnn\QnnHtp.dll
+C:\ai-hub\Riffusion\qnn\QnnHtpV73Stub.dll
+C:\ai-hub\Riffusion\qnn\QnnSystem.dll
+C:\ai-hub\Riffusion\qnn\libqnnhtpv73.cat
+```
+
+## Riffusion QNN models
+Download the quantized Riffusion QNN models from Qualcomm® AI Hub:
+https://aihub.qualcomm.com/compute/models/riffusion_quantized
+
+After downloaded the models, copy them to the following path:
+```
+C:\ai-hub\Riffusion\models\riffusion_quantized-textencoder_quantized.bin
+C:\ai-hub\Riffusion\models\riffusion_quantized-unet_quantized.bin
+C:\ai-hub\Riffusion\models\riffusion_quantized-vaedecoder_quantized.bin
+```
+
+## time-embedding
+In this sample code, we need to use 'time-embedding' data. The below code can be used to generate the 'time-embedding' data:
+```
+import os
+import torch
+import numpy as np
+from diffusers.models.embeddings import get_timestep_embedding
+from diffusers import UNet2DConditionModel
+from diffusers import DPMSolverMultistepScheduler
+
+user_step = 20
+time_embeddings = UNet2DConditionModel.from_pretrained('riffusion/riffusion-model-v1', subfolder='unet', cache_dir='./cache').time_embedding
+scheduler = DPMSolverMultistepScheduler(num_train_timesteps=1000, beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear")
+
+def get_timestep(step):
+    return np.int32(scheduler.timesteps.numpy()[step])
+
+def get_time_embedding(timestep):
+    timestep = torch.tensor([timestep])
+    t_emb = get_timestep_embedding(timestep, 320, True, 0)
+    emb = time_embeddings(t_emb).detach().numpy()
+    return emb
+
+def gen_time_embedding():
+    scheduler.set_timesteps(user_step)
+    
+    time_emb_path = ".\\models\\time-embedding_riffusion\\" + str(user_step) + "\\"
+    os.mkdir(time_emb_path)
+    for step in range(user_step):
+        file_path = time_emb_path + str(step) + ".raw"
+        timestep = get_timestep(step)
+        time_embedding = get_time_embedding(timestep)
+        time_embedding.tofile(file_path)
+
+# Only needs to executed once for generating time enbedding data to app folder.
+# Modify 'user_step' to '20', '30', '50' to generate 'time_embedding' for steps - '20', '30', '50'.
+
+user_step = 20
+gen_time_embedding()
+
+user_step = 30
+gen_time_embedding()
+
+user_step = 50
+gen_time_embedding()
+```
+
+After generated the 'time-embedding' data, please copy them to the following path:
+```
+C:\ai-hub\Riffusion\models\time-embedding_riffusion\20
+C:\ai-hub\Riffusion\models\time-embedding_riffusion\30
+C:\ai-hub\Riffusion\models\time-embedding_riffusion\50
+```
+
+## CLIP ViT-L/14 model
+In this sample code, we need CLIP ViT-L/14 as text encoder. You can download the file below from 'https://huggingface.co/riffusion/riffusion-model-v1/tree/main/tokenizer' and save them to foldet 'clip-vit-large-patch14'.
+Rename the files to below:
+```
+merges.txt
+special_tokens_map.json
+tokenizer_config.json
+vocab.json
+```
+
+After downloaded the model, please copy them to the following path:
+```
+C:\ai-hub\Riffusion\models\tokenizer_riffusion
+```
+
+## Run the sample code
+Download the sample code from the following link:
+https://github.com/quic/ai-engine-direct-helper/blob/main/Samples/riffusion/riffusion.py
+
+After downloaded the sample code, please copy them to the following path:
+```
+C:\ai-hub\Riffusion\
+```
+
+Run the sample code:
+```
+python riffusion.py
+```
+
+## Output
+The output image will be saved to the following path:
+```
+C:\ai-hub\Riffusion\images\
+```
+
+## Reference
+You need to setup the AppBuilder environment before you run the sample code. Below is the guide on how to setup the AppBuilder environment:
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
diff --git a/samples/python/riffusion/Riffusion.py b/samples/python/riffusion/Riffusion.py
new file mode 100644
index 0000000..c2ef751
--- /dev/null
+++ b/samples/python/riffusion/Riffusion.py
@@ -0,0 +1,272 @@
+# ---------------------------------------------------------------------
+# Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+# ---------------------------------------------------------------------
+
+import time
+from PIL import Image
+import os
+import shutil
+import cv2
+import numpy as np
+import torch
+from transformers import CLIPTokenizer
+from diffusers import DPMSolverMultistepScheduler
+
+from qai_appbuilder import (QNNContext, QNNContextProc, QNNShareMemory, Runtime, LogLevel, ProfilingLevel, PerfProfile, QNNConfig, timer)
+
+####################################################################
+
+execution_ws = os.getcwd()
+qnn_dir = execution_ws + "\\qnn"
+
+#Model pathes.
+model_dir = execution_ws + "\\models"
+sd_dir = model_dir
+clip_dir = model_dir + "\\tokenizer_riffusion\\"
+time_embedding_dir = model_dir + "\\time-embedding_riffusion\\"
+
+tokenizer = None
+scheduler = None
+tokenizer_max_length = 77   # Define Tokenizer output max length (must be 77)
+
+# model objects.
+text_encoder = None
+unet = None
+vae_decoder = None
+
+# Any user defined prompt
+user_prompt = ""
+uncond_prompt = ""
+user_seed = np.int64(0)
+user_step = 20              # User defined step value, any integer value in {20, 30, 50}
+user_text_guidance = 7.5    # User define text guidance, any float value in [5.0, 15.0]
+
+####################################################################
+
+class TextEncoder(QNNContext):
+    def Inference(self, input_data):
+        input_datas=[input_data]
+        output_data = super().Inference(input_datas)[0]
+
+        # Output of Text encoder should be of shape (1, 77, 768)
+        output_data = output_data.reshape((1, 77, 768))
+        return output_data
+
+class Unet(QNNContext):
+    def Inference(self, input_data_1, input_data_2, input_data_3):
+        # We need to reshape the array to 1 dimensionality before send it to the network. 'input_data_2' already is 1 dimensionality, so doesn't need to reshape.
+        input_data_1 = input_data_1.reshape(input_data_1.size)
+        input_data_3 = input_data_3.reshape(input_data_3.size)
+
+        input_datas=[input_data_1, input_data_2, input_data_3]
+        output_data = super().Inference(input_datas)[0]
+
+        output_data = output_data.reshape(1, 64, 64, 4)
+        return output_data
+
+class VaeDecoder(QNNContext):
+    def Inference(self, input_data):
+        input_data = input_data.reshape(input_data.size)
+        input_datas=[input_data]
+
+        output_data = super().Inference(input_datas)[0]
+        
+        return output_data
+
+####################################################################
+
+
+def model_initialize():
+    global scheduler
+    global tokenizer
+    global text_encoder
+    global unet
+    global vae_decoder
+
+    result = True
+
+    # model names
+    model_text_encoder  = "text_encoder"
+    model_unet          = "model_unet"
+    model_vae_decoder   = "vae_decoder"
+
+    # models' path.
+    text_encoder_model = sd_dir + "\\riffusion_quantized-textencoder_quantized.bin"
+    unet_model = sd_dir + "\\riffusion_quantized-unet_quantized.bin"
+    vae_decoder_model = sd_dir + "\\riffusion_quantized-vaedecoder_quantized.bin"
+
+    # Instance for TextEncoder 
+    text_encoder = TextEncoder(model_text_encoder, text_encoder_model)
+
+    # Instance for Unet 
+    unet = Unet(model_unet, unet_model)
+
+    # Instance for VaeDecoder 
+    vae_decoder = VaeDecoder(model_vae_decoder, vae_decoder_model)
+
+    # Initializing the Tokenizer
+    tokenizer = CLIPTokenizer.from_pretrained(clip_dir, local_files_only=True)
+
+    # Scheduler - initializing the Scheduler.
+    scheduler = DPMSolverMultistepScheduler(num_train_timesteps=1000, beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear")
+
+    return result
+
+def run_tokenizer(prompt):
+    text_input = tokenizer(prompt, padding="max_length", max_length=tokenizer_max_length, truncation=True)
+    text_input = np.array(text_input.input_ids, dtype=np.float32)
+    return text_input
+
+# These parameters can be configured through GUI 'settings'.
+def setup_parameters(prompt, un_prompt, seed, step, text_guidance):
+
+    global user_prompt
+    global uncond_prompt
+    global user_seed
+    global user_step
+    global user_text_guidance
+
+    user_prompt = prompt
+    uncond_prompt = un_prompt
+    user_seed = seed
+    user_step = step
+    user_text_guidance = text_guidance
+
+    assert isinstance(user_seed, np.int64) == True, "user_seed should be of type int64"
+    assert isinstance(user_step, int) == True, "user_step should be of type int"
+    assert user_step == 20 or user_step == 30 or user_step == 50, "user_step should be either 20, 30 or 50"
+    assert isinstance(user_text_guidance, float) == True, "user_text_guidance should be of type float"
+    assert user_text_guidance >= 5.0 and user_text_guidance <= 15.0, "user_text_guidance should be a float from [5.0, 15.0]"
+
+def run_scheduler(noise_pred_uncond, noise_pred_text, latent_in, timestep):
+    # Convert all inputs from NHWC to NCHW
+    noise_pred_uncond = np.transpose(noise_pred_uncond, (0, 3, 1, 2)).copy()
+    noise_pred_text = np.transpose(noise_pred_text, (0, 3, 1, 2)).copy()
+    latent_in = np.transpose(latent_in, (0, 3, 1, 2)).copy()
+
+    # Convert all inputs to torch tensors
+    noise_pred_uncond = torch.from_numpy(noise_pred_uncond)
+    noise_pred_text = torch.from_numpy(noise_pred_text)
+    latent_in = torch.from_numpy(latent_in)
+
+    # Merge noise_pred_uncond and noise_pred_text based on user_text_guidance
+    noise_pred = noise_pred_uncond + user_text_guidance * (noise_pred_text - noise_pred_uncond)
+
+    # Run Scheduler step
+    latent_out = scheduler.step(noise_pred, timestep, latent_in).prev_sample.numpy()
+
+    # Convert latent_out from NCHW to NHWC
+    latent_out = np.transpose(latent_out, (0, 2, 3, 1)).copy()
+
+    return latent_out
+
+# Function to get timesteps
+def get_timestep(step):
+    return np.int32(scheduler.timesteps.numpy()[step])
+
+# Execute the Stable Diffusion pipeline
+def model_execute(callback):
+    PerfProfile.SetPerfProfileGlobal(PerfProfile.BURST)
+
+    scheduler.set_timesteps(user_step)  # Setting up user provided time steps for Scheduler
+
+    # Run Tokenizer
+    cond_tokens = run_tokenizer(user_prompt)
+    uncond_tokens = run_tokenizer(uncond_prompt)
+
+    # Run Text Encoder on Tokens
+    uncond_text_embedding = text_encoder.Inference(uncond_tokens)
+    user_text_embedding = text_encoder.Inference(cond_tokens)
+
+    # Initialize the latent input with random initial latent
+    random_init_latent = torch.randn((1, 4, 64, 64), generator=torch.manual_seed(user_seed)).numpy()
+    latent_in = random_init_latent.transpose(0, 2, 3, 1)
+
+    time_emb_path = time_embedding_dir + str(user_step) + "\\"
+
+    # Run the loop for user_step times
+    for step in range(user_step):
+        print(f'Step {step} Running...')
+
+        timestep = get_timestep(step)   # time_embedding = get_time_embedding(timestep)
+        file_path = time_emb_path + str(step) + ".raw"
+        time_embedding = np.fromfile(file_path, dtype=np.float32)
+
+        unconditional_noise_pred = unet.Inference(latent_in, time_embedding, uncond_text_embedding)
+        conditional_noise_pred = unet.Inference(latent_in, time_embedding, user_text_embedding)
+
+        latent_in = run_scheduler(unconditional_noise_pred, conditional_noise_pred, latent_in, timestep)
+        callback(step)
+
+    # Run VAE
+    import datetime
+    now = datetime.datetime.now()
+    output_image = vae_decoder.Inference(latent_in)
+    formatted_time = now.strftime("%Y_%m_%d_%H_%M_%S")
+
+    if len(output_image) == 0:
+        callback(None)
+    else:
+        image_size = 512
+        if not os.path.exists("images"):
+            os.mkdir("images")
+        image_path = "images\\%s_%s_%s.jpg"%(formatted_time, str(user_seed), str(image_size))
+        output_image = np.clip(output_image * 255.0, 0.0, 255.0).astype(np.uint8)
+        output_image = output_image.reshape(image_size, image_size, -1)
+        Image.fromarray(output_image, mode="RGB").save(image_path)
+
+        callback(image_path)
+
+    PerfProfile.RelPerfProfileGlobal()
+
+# Release all the models.
+def model_destroy():
+    global text_encoder
+    global unet
+    global vae_decoder
+
+    del(text_encoder)
+    del(unet)
+    del(vae_decoder)
+
+####################################################################
+
+def SetQNNConfig():
+    QNNConfig.Config(qnn_dir, Runtime.HTP, LogLevel.WARN, ProfilingLevel.BASIC)
+
+
+####################################################################
+
+def modelExecuteCallback(result):
+    if ((None == result) or isinstance(result, str)):   # None == Image generates failed. 'str' == image_path: generated new image path.
+        if (None == result):
+            result = "None"
+        print("modelExecuteCallback result: " + result)
+    else:
+        result = (result + 1) * 100
+        result = int(result / user_step)
+        result = str(result)
+        print("modelExecuteCallback result: " + result)
+
+
+SetQNNConfig()
+
+model_initialize()
+
+time_start = time.time()
+
+user_prompt = "Big white bird near river in high resolution, 4K"
+uncond_prompt = "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry"
+user_seed = np.random.randint(low=0, high=9999999999, size=None, dtype=np.int64)
+user_step = 20
+user_text_guidance = 7.5
+
+setup_parameters(user_prompt, uncond_prompt, user_seed, user_step, user_text_guidance)
+
+model_execute(modelExecuteCallback)
+
+time_end = time.time()
+print("time consumes for inference {}(s)".format(str(time_end - time_start)))
+
+model_destroy()
diff --git a/samples/python/stable_diffusion_v1_5/README.md b/samples/python/stable_diffusion_v1_5/README.md
new file mode 100644
index 0000000..2ab6a06
--- /dev/null
+++ b/samples/python/stable_diffusion_v1_5/README.md
@@ -0,0 +1,122 @@
+# stable_diffusion_v1_5 Sample Code
+
+## Introduction
+This is sample code for using AppBuilder to load Stable Diffusion 1.5 QNN models to HTP and execute inference to generate image. 
+
+## Setup AppBuilder environment and prepare QNN SDK libraries by referring to the links below: 
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
+
+Copy the QNN libraries from QNN SDK to below path:
+```
+C:\ai-hub\stable_diffusion_v1_5\qnn\libQnnHtpV73Skel.so
+C:\ai-hub\stable_diffusion_v1_5\qnn\QnnHtp.dll
+C:\ai-hub\stable_diffusion_v1_5\qnn\QnnHtpV73Stub.dll
+C:\ai-hub\stable_diffusion_v1_5\qnn\QnnSystem.dll
+C:\ai-hub\stable_diffusion_v1_5\qnn\libqnnhtpv73.cat
+```
+
+## Stable Diffusion 1.5 QNN models
+Download the quantized Stable Diffusion 1.5 QNN models from Qualcomm® AI Hub:
+https://aihub.qualcomm.com/compute/models/stable_diffusion_v1_5_quantized
+
+After downloaded the models, copy them to the following path:
+```
+C:\ai-hub\stable_diffusion_v1_5\models\stable_diffusion_v1_5_quantized-textencoder_quantized.bin
+C:\ai-hub\stable_diffusion_v1_5\models\stable_diffusion_v1_5_quantized-unet_quantized.bin
+C:\ai-hub\stable_diffusion_v1_5\models\stable_diffusion_v1_5_quantized-vaedecoder_quantized.bin
+```
+
+## time-embedding
+In this sample code, we need to use 'time-embedding' data. The below code can be used to generate the 'time-embedding' data:
+```
+import os
+import torch
+import numpy as np
+from diffusers.models.embeddings import get_timestep_embedding
+from diffusers import UNet2DConditionModel
+from diffusers import DPMSolverMultistepScheduler
+
+user_step = 20
+time_embeddings = UNet2DConditionModel.from_pretrained('runwayml/stable-diffusion-v1-5', subfolder='unet', cache_dir='./cache').time_embedding
+scheduler = DPMSolverMultistepScheduler(num_train_timesteps=1000, beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear")
+
+def get_timestep(step):
+    return np.int32(scheduler.timesteps.numpy()[step])
+
+def get_time_embedding(timestep):
+    timestep = torch.tensor([timestep])
+    t_emb = get_timestep_embedding(timestep, 320, True, 0)
+    emb = time_embeddings(t_emb).detach().numpy()
+    return emb
+
+def gen_time_embedding():
+    scheduler.set_timesteps(user_step)
+    
+    time_emb_path = ".\\models\\time-embedding_v1.5\\" + str(user_step) + "\\"
+    os.mkdir(time_emb_path)
+    for step in range(user_step):
+        file_path = time_emb_path + str(step) + ".raw"
+        timestep = get_timestep(step)
+        time_embedding = get_time_embedding(timestep)
+        time_embedding.tofile(file_path)
+
+# Only needs to executed once for generating time enbedding data to app folder.
+# Modify 'user_step' to '20', '30', '50' to generate 'time_embedding' for steps - '20', '30', '50'.
+
+user_step = 20
+gen_time_embedding()
+
+user_step = 30
+gen_time_embedding()
+
+user_step = 50
+gen_time_embedding()
+```
+
+After generated the 'time-embedding' data, please copy them to the following path:
+```
+C:\ai-hub\stable_diffusion_v1_5\models\time-embedding_v1.5\20
+C:\ai-hub\stable_diffusion_v1_5\models\time-embedding_v1.5\30
+C:\ai-hub\stable_diffusion_v1_5\models\time-embedding_v1.5\50
+```
+
+## CLIP ViT-L/14 model
+In this sample code, we need CLIP ViT-L/14 as text encoder. You can download the file below from 'https://huggingface.co/openai/clip-vit-large-patch14/tree/main' and save them to foldet 'clip-vit-large-patch14'. 
+Rename the files to below:
+```
+merges.txt
+special_tokens_map.json
+tokenizer_config.json
+vocab.json
+```
+
+After downloaded the model, please copy them to the following path:
+```
+C:\ai-hub\stable_diffusion_v1_5\models\clip-vit-large-patch14
+```
+
+## Run the sample code
+Download the sample code from the following link:
+https://github.com/quic/ai-engine-direct-helper/blob/main/Samples/stable_diffusion_v1_5/stable_diffusion_v1_5.py
+
+After downloaded the sample code, please copy them to the following path:
+```
+C:\ai-hub\stable_diffusion_v1_5\
+```
+
+Run the sample code:
+```
+python stable_diffusion_v1_5.py
+```
+
+## Output
+The output image will be saved to the following path:
+```
+C:\ai-hub\stable_diffusion_v1_5\images\
+```
+
+## Reference
+You need to setup the AppBuilder environment before you run the sample code. Below is the guide on how to setup the AppBuilder environment:
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
diff --git a/samples/python/stable_diffusion_v1_5/stable_diffusion_v1_5.py b/samples/python/stable_diffusion_v1_5/stable_diffusion_v1_5.py
new file mode 100644
index 0000000..a6be77f
--- /dev/null
+++ b/samples/python/stable_diffusion_v1_5/stable_diffusion_v1_5.py
@@ -0,0 +1,272 @@
+# ---------------------------------------------------------------------
+# Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+# ---------------------------------------------------------------------
+
+import time
+from PIL import Image
+import os
+import shutil
+import cv2
+import numpy as np
+import torch
+from transformers import CLIPTokenizer
+from diffusers import DPMSolverMultistepScheduler
+
+from qai_appbuilder import (QNNContext, QNNContextProc, QNNShareMemory, Runtime, LogLevel, ProfilingLevel, PerfProfile, QNNConfig, timer)
+
+####################################################################
+
+execution_ws = os.getcwd()
+qnn_dir = execution_ws + "\\qnn"
+
+#Model pathes.
+model_dir = execution_ws + "\\models"
+sd_dir = model_dir
+clip_dir = model_dir + "\\clip-vit-large-patch14\\"
+time_embedding_dir = model_dir + "\\time-embedding_v1.5\\"
+
+tokenizer = None
+scheduler = None
+tokenizer_max_length = 77   # Define Tokenizer output max length (must be 77)
+
+# model objects.
+text_encoder = None
+unet = None
+vae_decoder = None
+
+# Any user defined prompt
+user_prompt = ""
+uncond_prompt = ""
+user_seed = np.int64(0)
+user_step = 20              # User defined step value, any integer value in {20, 30, 50}
+user_text_guidance = 7.5    # User define text guidance, any float value in [5.0, 15.0]
+
+####################################################################
+
+class TextEncoder(QNNContext):
+    def Inference(self, input_data):
+        input_datas=[input_data]
+        output_data = super().Inference(input_datas)[0]
+
+        # Output of Text encoder should be of shape (1, 77, 768)
+        output_data = output_data.reshape((1, 77, 768))
+        return output_data
+
+class Unet(QNNContext):
+    def Inference(self, input_data_1, input_data_2, input_data_3):
+        # We need to reshape the array to 1 dimensionality before send it to the network. 'input_data_2' already is 1 dimensionality, so doesn't need to reshape.
+        input_data_1 = input_data_1.reshape(input_data_1.size)
+        input_data_3 = input_data_3.reshape(input_data_3.size)
+
+        input_datas=[input_data_1, input_data_2, input_data_3]
+        output_data = super().Inference(input_datas)[0]
+
+        output_data = output_data.reshape(1, 64, 64, 4)
+        return output_data
+
+class VaeDecoder(QNNContext):
+    def Inference(self, input_data):
+        input_data = input_data.reshape(input_data.size)
+        input_datas=[input_data]
+
+        output_data = super().Inference(input_datas)[0]
+        
+        return output_data
+
+####################################################################
+
+
+def model_initialize():
+    global scheduler
+    global tokenizer
+    global text_encoder
+    global unet
+    global vae_decoder
+
+    result = True
+
+    # model names
+    model_text_encoder  = "text_encoder"
+    model_unet          = "model_unet"
+    model_vae_decoder   = "vae_decoder"
+
+    # models' path.
+    text_encoder_model = sd_dir + "\\stable_diffusion_v1_5_quantized-textencoder_quantized.bin"
+    unet_model = sd_dir + "\\stable_diffusion_v1_5_quantized-unet_quantized.bin"
+    vae_decoder_model = sd_dir + "\\stable_diffusion_v1_5_quantized-vaedecoder_quantized.bin"
+
+    # Instance for TextEncoder 
+    text_encoder = TextEncoder(model_text_encoder, text_encoder_model)
+
+    # Instance for Unet 
+    unet = Unet(model_unet, unet_model)
+
+    # Instance for VaeDecoder 
+    vae_decoder = VaeDecoder(model_vae_decoder, vae_decoder_model)
+
+    # Initializing the Tokenizer
+    tokenizer = CLIPTokenizer.from_pretrained(clip_dir, local_files_only=True)
+
+    # Scheduler - initializing the Scheduler.
+    scheduler = DPMSolverMultistepScheduler(num_train_timesteps=1000, beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear")
+
+    return result
+
+def run_tokenizer(prompt):
+    text_input = tokenizer(prompt, padding="max_length", max_length=tokenizer_max_length, truncation=True)
+    text_input = np.array(text_input.input_ids, dtype=np.float32)
+    return text_input
+
+# These parameters can be configured through GUI 'settings'.
+def setup_parameters(prompt, un_prompt, seed, step, text_guidance):
+
+    global user_prompt
+    global uncond_prompt
+    global user_seed
+    global user_step
+    global user_text_guidance
+
+    user_prompt = prompt
+    uncond_prompt = un_prompt
+    user_seed = seed
+    user_step = step
+    user_text_guidance = text_guidance
+
+    assert isinstance(user_seed, np.int64) == True, "user_seed should be of type int64"
+    assert isinstance(user_step, int) == True, "user_step should be of type int"
+    assert user_step == 20 or user_step == 30 or user_step == 50, "user_step should be either 20, 30 or 50"
+    assert isinstance(user_text_guidance, float) == True, "user_text_guidance should be of type float"
+    assert user_text_guidance >= 5.0 and user_text_guidance <= 15.0, "user_text_guidance should be a float from [5.0, 15.0]"
+
+def run_scheduler(noise_pred_uncond, noise_pred_text, latent_in, timestep):
+    # Convert all inputs from NHWC to NCHW
+    noise_pred_uncond = np.transpose(noise_pred_uncond, (0, 3, 1, 2)).copy()
+    noise_pred_text = np.transpose(noise_pred_text, (0, 3, 1, 2)).copy()
+    latent_in = np.transpose(latent_in, (0, 3, 1, 2)).copy()
+
+    # Convert all inputs to torch tensors
+    noise_pred_uncond = torch.from_numpy(noise_pred_uncond)
+    noise_pred_text = torch.from_numpy(noise_pred_text)
+    latent_in = torch.from_numpy(latent_in)
+
+    # Merge noise_pred_uncond and noise_pred_text based on user_text_guidance
+    noise_pred = noise_pred_uncond + user_text_guidance * (noise_pred_text - noise_pred_uncond)
+
+    # Run Scheduler step
+    latent_out = scheduler.step(noise_pred, timestep, latent_in).prev_sample.numpy()
+
+    # Convert latent_out from NCHW to NHWC
+    latent_out = np.transpose(latent_out, (0, 2, 3, 1)).copy()
+
+    return latent_out
+
+# Function to get timesteps
+def get_timestep(step):
+    return np.int32(scheduler.timesteps.numpy()[step])
+
+# Execute the Stable Diffusion pipeline
+def model_execute(callback):
+    PerfProfile.SetPerfProfileGlobal(PerfProfile.BURST)
+
+    scheduler.set_timesteps(user_step)  # Setting up user provided time steps for Scheduler
+
+    # Run Tokenizer
+    cond_tokens = run_tokenizer(user_prompt)
+    uncond_tokens = run_tokenizer(uncond_prompt)
+
+    # Run Text Encoder on Tokens
+    uncond_text_embedding = text_encoder.Inference(uncond_tokens)
+    user_text_embedding = text_encoder.Inference(cond_tokens)
+
+    # Initialize the latent input with random initial latent
+    random_init_latent = torch.randn((1, 4, 64, 64), generator=torch.manual_seed(user_seed)).numpy()
+    latent_in = random_init_latent.transpose(0, 2, 3, 1)
+
+    time_emb_path = time_embedding_dir + str(user_step) + "\\"
+
+    # Run the loop for user_step times
+    for step in range(user_step):
+        print(f'Step {step} Running...')
+
+        timestep = get_timestep(step)   # time_embedding = get_time_embedding(timestep)
+        file_path = time_emb_path + str(step) + ".raw"
+        time_embedding = np.fromfile(file_path, dtype=np.float32)
+
+        unconditional_noise_pred = unet.Inference(latent_in, time_embedding, uncond_text_embedding)
+        conditional_noise_pred = unet.Inference(latent_in, time_embedding, user_text_embedding)
+
+        latent_in = run_scheduler(unconditional_noise_pred, conditional_noise_pred, latent_in, timestep)
+        callback(step)
+
+    # Run VAE
+    import datetime
+    now = datetime.datetime.now()
+    output_image = vae_decoder.Inference(latent_in)
+    formatted_time = now.strftime("%Y_%m_%d_%H_%M_%S")
+
+    if len(output_image) == 0:
+        callback(None)
+    else:
+        image_size = 512
+        if not os.path.exists("images"):
+            os.mkdir("images")
+        image_path = "images\\%s_%s_%s.jpg"%(formatted_time, str(user_seed), str(image_size))
+        output_image = np.clip(output_image * 255.0, 0.0, 255.0).astype(np.uint8)
+        output_image = output_image.reshape(image_size, image_size, -1)
+        Image.fromarray(output_image, mode="RGB").save(image_path)
+
+        callback(image_path)
+
+    PerfProfile.RelPerfProfileGlobal()
+
+# Release all the models.
+def model_destroy():
+    global text_encoder
+    global unet
+    global vae_decoder
+
+    del(text_encoder)
+    del(unet)
+    del(vae_decoder)
+
+####################################################################
+
+def SetQNNConfig():
+    QNNConfig.Config(qnn_dir, Runtime.HTP, LogLevel.WARN, ProfilingLevel.BASIC)
+
+
+####################################################################
+
+def modelExecuteCallback(result):
+    if ((None == result) or isinstance(result, str)):   # None == Image generates failed. 'str' == image_path: generated new image path.
+        if (None == result):
+            result = "None"
+        print("modelExecuteCallback result: " + result)
+    else:
+        result = (result + 1) * 100
+        result = int(result / user_step)
+        result = str(result)
+        print("modelExecuteCallback result: " + result)
+
+
+SetQNNConfig()
+
+model_initialize()
+
+time_start = time.time()
+
+user_prompt = "Big white bird near river in high resolution, 4K"
+uncond_prompt = "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry"
+user_seed = np.random.randint(low=0, high=9999999999, size=None, dtype=np.int64)
+user_step = 20
+user_text_guidance = 7.5
+
+setup_parameters(user_prompt, uncond_prompt, user_seed, user_step, user_text_guidance)
+
+model_execute(modelExecuteCallback)
+
+time_end = time.time()
+print("time consumes for inference {}(s)".format(str(time_end - time_start)))
+
+model_destroy()
diff --git a/samples/python/stable_diffusion_v2_1/README.md b/samples/python/stable_diffusion_v2_1/README.md
new file mode 100644
index 0000000..fdaee14
--- /dev/null
+++ b/samples/python/stable_diffusion_v2_1/README.md
@@ -0,0 +1,122 @@
+# stable_diffusion_v2_1 Sample Code
+
+## Introduction
+This is sample code for using AppBuilder to load Stable Diffusion 2.1 QNN models to HTP and execute inference to generate image. 
+
+## Setup AppBuilder environment and prepare QNN SDK libraries by referring to the links below: 
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
+
+Copy the QNN libraries from QNN SDK to below path:
+```
+C:\ai-hub\stable_diffusion_v2_1\qnn\libQnnHtpV73Skel.so
+C:\ai-hub\stable_diffusion_v2_1\qnn\QnnHtp.dll
+C:\ai-hub\stable_diffusion_v2_1\qnn\QnnHtpV73Stub.dll
+C:\ai-hub\stable_diffusion_v2_1\qnn\QnnSystem.dll
+C:\ai-hub\stable_diffusion_v2_1\qnn\libqnnhtpv73.cat
+```
+
+## Stable Diffusion 2.1 QNN models
+Download the quantized Stable Diffusion 2.1 QNN models from Qualcomm® AI Hub:
+https://aihub.qualcomm.com/compute/models/stable_diffusion_v2_1_quantized
+
+After downloaded the models, copy them to the following path:
+```
+C:\ai-hub\stable_diffusion_v2_1\models\stable_diffusion_v2_1_quantized-textencoder_quantized.bin
+C:\ai-hub\stable_diffusion_v2_1\models\stable_diffusion_v2_1_quantized-unet_quantized.bin
+C:\ai-hub\stable_diffusion_v2_1\models\stable_diffusion_v2_1_quantized-vaedecoder_quantized.bin
+```
+
+## time-embedding
+In this sample code, we need to use 'time-embedding' data. The below code can be used to generate the 'time-embedding' data:
+```
+import os
+import torch
+import numpy as np
+from diffusers.models.embeddings import get_timestep_embedding
+from diffusers import UNet2DConditionModel
+from diffusers import DPMSolverMultistepScheduler
+
+user_step = 20
+time_embeddings = UNet2DConditionModel.from_pretrained('stabilityai/stable-diffusion-2-1-base', subfolder='unet', revision="main", cache_dir='./cache').time_embedding
+scheduler = DPMSolverMultistepScheduler(num_train_timesteps=1000, beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear")
+
+def get_timestep(step):
+    return np.int32(scheduler.timesteps.numpy()[step])
+
+def get_time_embedding(timestep):
+    timestep = torch.tensor([timestep])
+    t_emb = get_timestep_embedding(timestep, 320, True, 0)
+    emb = time_embeddings(t_emb).detach().numpy()
+    return emb
+
+def gen_time_embedding():
+    scheduler.set_timesteps(user_step)
+    
+    time_emb_path = ".\\models\\time-embedding_v2.1\\" + str(user_step) + "\\"
+    os.mkdir(time_emb_path)
+    for step in range(user_step):
+        file_path = time_emb_path + str(step) + ".raw"
+        timestep = get_timestep(step)
+        time_embedding = get_time_embedding(timestep)
+        time_embedding.tofile(file_path)
+
+# Only needs to executed once for generating time enbedding data to app folder.
+# Modify 'user_step' to '20', '30', '50' to generate 'time_embedding' for steps - '20', '30', '50'.
+
+user_step = 20
+gen_time_embedding()
+
+user_step = 30
+gen_time_embedding()
+
+user_step = 50
+gen_time_embedding()
+```
+
+After generated the 'time-embedding' data, please copy them to the following path:
+```
+C:\ai-hub\stable_diffusion_v2_1\models\time-embedding_v2.1\20
+C:\ai-hub\stable_diffusion_v2_1\models\time-embedding_v2.1\30
+C:\ai-hub\stable_diffusion_v2_1\models\time-embedding_v2.1\50
+```
+
+## CLIP ViT-L/14 model
+In this sample code, we need CLIP ViT-L/14 as text encoder. You can download the file below from 'https://huggingface.co/stabilityai/stable-diffusion-2-1-base/tree/main/tokenizer' and save them to foldet 'clip-vit-large-patch14'.
+Rename the files to below:
+```
+merges.txt
+special_tokens_map.json
+tokenizer_config.json
+vocab.json
+```
+
+After downloaded the model, please copy them to the following path:
+```
+C:\ai-hub\stable_diffusion_v2_1\models\tokenizer_2.1
+```
+
+## Run the sample code
+Download the sample code from the following link:
+https://github.com/quic/ai-engine-direct-helper/blob/main/Samples/stable_diffusion_v2_1/stable_diffusion_v2_1.py
+
+After downloaded the sample code, please copy them to the following path:
+```
+C:\ai-hub\stable_diffusion_v2_1\
+```
+
+Run the sample code:
+```
+python stable_diffusion_v2_1.py
+```
+
+## Output
+The output image will be saved to the following path:
+```
+C:\ai-hub\stable_diffusion_v2_1\images\
+```
+
+## Reference
+You need to setup the AppBuilder environment before you run the sample code. Below is the guide on how to setup the AppBuilder environment:
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
diff --git a/samples/python/stable_diffusion_v2_1/stable_diffusion_v2_1.py b/samples/python/stable_diffusion_v2_1/stable_diffusion_v2_1.py
new file mode 100644
index 0000000..9d81e63
--- /dev/null
+++ b/samples/python/stable_diffusion_v2_1/stable_diffusion_v2_1.py
@@ -0,0 +1,264 @@
+# ---------------------------------------------------------------------
+# Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+# ---------------------------------------------------------------------
+
+import time
+from PIL import Image
+import os
+import shutil
+import cv2
+import numpy as np
+import torch
+from transformers import CLIPTokenizer
+from diffusers import DPMSolverMultistepScheduler
+
+from qai_appbuilder import (QNNContext, Runtime, LogLevel, ProfilingLevel, PerfProfile, QNNConfig, timer)
+
+####################################################################
+
+execution_ws = os.getcwd()
+qnn_dir = execution_ws + "\\qnn"
+
+#Model pathes.
+model_dir = execution_ws + "\\models"
+sd_dir = model_dir
+clip_dir = model_dir + "\\tokenizer_2.1\\"
+time_embedding_dir = model_dir + "\\time-embedding_v2.1\\"
+
+tokenizer = None
+scheduler = None
+tokenizer_max_length = 77   # Define Tokenizer output max length (must be 77)
+
+# model objects.
+text_encoder = None
+unet = None
+vae_decoder = None
+
+# Any user defined prompt
+user_prompt = ""
+uncond_prompt = ""
+user_seed = np.int64(0)
+user_step = 20              # User defined step value, any integer value in {20, 30, 50}
+user_text_guidance = 7.5    # User define text guidance, any float value in [5.0, 15.0]
+
+####################################################################
+
+class TextEncoder(QNNContext):
+    def Inference(self, input_data):
+        input_datas=[input_data]
+        output_data = super().Inference(input_datas)[0]
+
+        # Output of Text encoder should be of shape (1, 77, 1024)
+        output_data = output_data.reshape((1, 77, 1024))
+        return output_data
+
+class Unet(QNNContext):
+    def Inference(self, input_data_1, input_data_2, input_data_3):
+        # We need to reshape the array to 1 dimensionality before send it to the network. 'input_data_2' already is 1 dimensionality, so doesn't need to reshape.
+        input_data_1 = input_data_1.reshape(input_data_1.size)
+        input_data_3 = input_data_3.reshape(input_data_3.size)
+
+        input_datas=[input_data_1, input_data_2, input_data_3]
+        output_data = super().Inference(input_datas)[0]
+
+        output_data = output_data.reshape(1, 4, 64, 64)
+        return output_data
+
+class VaeDecoder(QNNContext):
+    def Inference(self, input_data):
+        input_data = input_data.reshape(input_data.size)
+        input_datas=[input_data]
+
+        output_data = super().Inference(input_datas)[0]
+        
+        return output_data
+
+####################################################################
+
+
+def model_initialize():
+    global scheduler
+    global tokenizer
+    global text_encoder
+    global unet
+    global vae_decoder
+
+    result = True
+
+    # model names
+    model_text_encoder  = "text_encoder"
+    model_unet          = "model_unet"
+    model_vae_decoder   = "vae_decoder"
+
+    # models' path.
+    text_encoder_model = sd_dir + "\\stable_diffusion_v2_1_quantized-textencoder_quantized.bin"
+    unet_model = sd_dir + "\\stable_diffusion_v2_1_quantized-unet_quantized.bin"
+    vae_decoder_model = sd_dir + "\\stable_diffusion_v2_1_quantized-vaedecoder_quantized.bin"
+
+    # Instance for TextEncoder 
+    text_encoder = TextEncoder(model_text_encoder, text_encoder_model)
+
+    # Instance for Unet 
+    unet = Unet(model_unet, unet_model)
+
+    # Instance for VaeDecoder 
+    vae_decoder = VaeDecoder(model_vae_decoder, vae_decoder_model)
+
+    # Initializing the Tokenizer
+    tokenizer = CLIPTokenizer.from_pretrained(clip_dir, local_files_only=True)
+
+    # Scheduler - initializing the Scheduler.
+    scheduler = DPMSolverMultistepScheduler(num_train_timesteps=1000, beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear")
+
+    return result
+
+def run_tokenizer(prompt):
+    text_input = tokenizer(prompt, padding="max_length", max_length=tokenizer_max_length, truncation=True)
+    text_input = np.array(text_input.input_ids, dtype=np.float32)
+    return text_input
+
+# These parameters can be configured through GUI 'settings'.
+def setup_parameters(prompt, un_prompt, seed, step, text_guidance):
+
+    global user_prompt
+    global uncond_prompt
+    global user_seed
+    global user_step
+    global user_text_guidance
+
+    user_prompt = prompt
+    uncond_prompt = un_prompt
+    user_seed = seed
+    user_step = step
+    user_text_guidance = text_guidance
+
+    assert isinstance(user_seed, np.int64) == True, "user_seed should be of type int64"
+    assert isinstance(user_step, int) == True, "user_step should be of type int"
+    assert user_step == 20 or user_step == 30 or user_step == 50, "user_step should be either 20, 30 or 50"
+    assert isinstance(user_text_guidance, float) == True, "user_text_guidance should be of type float"
+    assert user_text_guidance >= 5.0 and user_text_guidance <= 15.0, "user_text_guidance should be a float from [5.0, 15.0]"
+
+def run_scheduler(noise_pred_uncond, noise_pred_text, latent_in, timestep):
+    # Convert all inputs to torch tensors
+    noise_pred_uncond = torch.from_numpy(noise_pred_uncond)
+    noise_pred_text = torch.from_numpy(noise_pred_text)
+    latent_in = torch.from_numpy(latent_in)
+
+    # Merge noise_pred_uncond and noise_pred_text based on user_text_guidance
+    noise_pred = noise_pred_uncond + user_text_guidance * (noise_pred_text - noise_pred_uncond)
+
+    # Run Scheduler step
+    latent_out = scheduler.step(noise_pred, timestep, latent_in).prev_sample.numpy()
+
+    return latent_out
+
+# Function to get timesteps
+def get_timestep(step):
+    return np.int32(scheduler.timesteps.numpy()[step])
+
+# Execute the Stable Diffusion pipeline
+def model_execute(callback):
+    PerfProfile.SetPerfProfileGlobal(PerfProfile.BURST)
+
+    scheduler.set_timesteps(user_step)  # Setting up user provided time steps for Scheduler
+
+    # Run Tokenizer
+    cond_tokens = run_tokenizer(user_prompt)
+    uncond_tokens = run_tokenizer(uncond_prompt)
+
+    # Run Text Encoder on Tokens
+    uncond_text_embedding = text_encoder.Inference(uncond_tokens)
+    user_text_embedding = text_encoder.Inference(cond_tokens)
+
+    # Initialize the latent input with random initial latent
+    latent_in = torch.randn((1, 4, 64, 64), generator=torch.manual_seed(user_seed)).numpy()
+
+
+    time_emb_path = time_embedding_dir + str(user_step) + "\\"
+
+    # Run the loop for user_step times
+    for step in range(user_step):
+        print(f'Step {step} Running...')
+
+        timestep = get_timestep(step)   # time_embedding = get_time_embedding(timestep)
+        file_path = time_emb_path + str(step) + ".raw"
+        time_embedding = np.fromfile(file_path, dtype=np.float32)
+
+        unconditional_noise_pred = unet.Inference(latent_in, time_embedding, uncond_text_embedding)
+        conditional_noise_pred = unet.Inference(latent_in, time_embedding, user_text_embedding)
+
+        latent_in = run_scheduler(unconditional_noise_pred, conditional_noise_pred, latent_in, timestep)
+        callback(step)
+
+    # Run VAE
+    import datetime
+    now = datetime.datetime.now()
+    output_image = vae_decoder.Inference(latent_in)
+    formatted_time = now.strftime("%Y_%m_%d_%H_%M_%S")
+
+    if len(output_image) == 0:
+        callback(None)
+    else:
+        image_size = 512
+        if not os.path.exists("images"):
+            os.mkdir("images")
+        image_path = "images\\%s_%s_%s.jpg"%(formatted_time, str(user_seed), str(image_size))
+        output_image = np.clip(output_image * 255.0, 0.0, 255.0).astype(np.uint8)
+        output_image = output_image.reshape(image_size, image_size, -1)
+        Image.fromarray(output_image, mode="RGB").save(image_path)
+
+        callback(image_path)
+
+    PerfProfile.RelPerfProfileGlobal()
+
+# Release all the models.
+def model_destroy():
+    global text_encoder
+    global unet
+    global vae_decoder
+
+    del(text_encoder)
+    del(unet)
+    del(vae_decoder)
+
+####################################################################
+
+def SetQNNConfig():
+    QNNConfig.Config(qnn_dir, Runtime.HTP, LogLevel.WARN, ProfilingLevel.BASIC)
+
+
+####################################################################
+
+def modelExecuteCallback(result):
+    if ((None == result) or isinstance(result, str)):   # None == Image generates failed. 'str' == image_path: generated new image path.
+        if (None == result):
+            result = "None"
+        print("modelExecuteCallback result: " + result)
+    else:
+        result = (result + 1) * 100
+        result = int(result / user_step)
+        result = str(result)
+        print("modelExecuteCallback result: " + result)
+
+
+SetQNNConfig()
+
+model_initialize()
+
+time_start = time.time()
+
+user_prompt = "Big white bird near river in high resolution, 4K"
+uncond_prompt = "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry"
+user_seed = np.random.randint(low=0, high=9999999999, size=None, dtype=np.int64)
+user_step = 20
+user_text_guidance = 7.5
+
+setup_parameters(user_prompt, uncond_prompt, user_seed, user_step, user_text_guidance)
+
+model_execute(modelExecuteCallback)
+
+time_end = time.time()
+print("time consumes for inference {}(s)".format(str(time_end - time_start)))
+
+model_destroy()
diff --git a/samples/python/unet_segmentation/README.md b/samples/python/unet_segmentation/README.md
new file mode 100644
index 0000000..39f6ad9
--- /dev/null
+++ b/samples/python/unet_segmentation/README.md
@@ -0,0 +1,58 @@
+# unet_segmentation Sample Code
+
+## Introduction
+This is sample code for using AppBuilder to load unet_segmentation QNN model to HTP and execute inference to produce a segmentation mask for an image.
+
+## Setup AppBuilder environment and prepare QNN SDK libraries by referring to the links below: 
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
+
+Copy the QNN libraries from QNN SDK to below path:
+```
+C:\ai-hub\unet_segmentation\qnn\libQnnHtpV73Skel.so
+C:\ai-hub\unet_segmentation\qnn\QnnHtp.dll
+C:\ai-hub\unet_segmentation\qnn\QnnHtpV73Stub.dll
+C:\ai-hub\unet_segmentation\qnn\QnnSystem.dll
+C:\ai-hub\unet_segmentation\qnn\libqnnhtpv73.cat
+```
+
+## unet_segmentation QNN models
+Download the quantized unet_segmentation QNN models from Qualcomm® AI Hub:
+https://aihub.qualcomm.com/compute/models/unet_segmentation
+
+After downloaded the model, copy it to the following path:
+```
+"C:\ai-hub\unet_segmentation\models\unet_segmentation.bin"
+```
+
+## Run the sample code
+Download the sample code from the following link:
+https://github.com/quic/ai-engine-direct-helper/blob/main/Samples/unet_segmentation/unet_segmentation.py
+
+After downloaded the sample code, please copy it to the following path:
+```
+C:\ai-hub\unet_segmentation\
+```
+
+Copy one sample image to following path:
+```
+C:\ai-hub\unet_segmentation\in.jpg
+```
+
+Run the sample code:	
+```
+python unet_segmentation.py
+```
+
+## Output
+The output image will be saved to the following path:
+```
+C:\ai-hub\unet_segmentation\out.jpg
+```
+
+## Reference
+You need to setup the AppBuilder environment before you run the sample code. Below is the guide on how to setup the AppBuilder environment:
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
+
+
diff --git a/samples/python/unet_segmentation/input.jpg b/samples/python/unet_segmentation/input.jpg
new file mode 100644
index 0000000..26d1543
Binary files /dev/null and b/samples/python/unet_segmentation/input.jpg differ
diff --git a/samples/python/unet_segmentation/unet_segmentation.py b/samples/python/unet_segmentation/unet_segmentation.py
new file mode 100644
index 0000000..004b1a6
--- /dev/null
+++ b/samples/python/unet_segmentation/unet_segmentation.py
@@ -0,0 +1,159 @@
+# ---------------------------------------------------------------------
+# Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+# ---------------------------------------------------------------------
+
+import os
+import numpy as np
+import math
+import torch
+import torchvision.transforms as transforms
+
+from PIL import Image
+from PIL.Image import fromarray as ImageFromArray
+from torch.nn.functional import interpolate, pad
+from torchvision import transforms
+from typing import Callable, Dict, List, Tuple
+
+from qai_appbuilder import (QNNContext, Runtime, LogLevel, ProfilingLevel, PerfProfile, QNNConfig)
+
+unetsegmentation = None
+
+def preprocess_PIL_image(image: Image) -> torch.Tensor:
+    """Convert a PIL image into a pyTorch tensor with range [0, 1] and shape NCHW."""
+    transform = transforms.Compose([transforms.PILToTensor()])  # bgr image
+    img: torch.Tensor = transform(image)  # type: ignore
+    img = img.float().unsqueeze(0) / 255.0  # int 0 - 255 to float 0.0 - 1.0
+    return img
+
+def torch_tensor_to_PIL_image(data: torch.Tensor) -> Image:
+    """
+    Convert a Torch tensor (dtype float32) with range [0, 1] and shape CHW into PIL image CHW
+    """
+    out = torch.clip(data, min=0.0, max=1.0)
+    np_out = (out.permute(1, 2, 0).detach().numpy() * 255).astype(np.uint8)
+    return ImageFromArray(np_out)
+
+def resize_pad(image: torch.Tensor, dst_size: Tuple[int, int]):
+    """
+    Resize and pad image to be shape [..., dst_size[0], dst_size[1]]
+
+    Parameters:
+        image: (..., H, W)
+            Image to reshape.
+
+        dst_size: (height, width)
+            Size to which the image should be reshaped.
+
+    Returns:
+        rescaled_padded_image: torch.Tensor (..., dst_size[0], dst_size[1])
+        scale: scale factor between original image and dst_size image, (w, h)
+        pad: pixels of padding added to the rescaled image: (left_padding, top_padding)
+
+    Based on https://github.com/zmurez/MediaPipePyTorch/blob/master/blazebase.py
+    """
+    height, width = image.shape[-2:]
+    dst_frame_height, dst_frame_width = dst_size
+
+    h_ratio = dst_frame_height / height
+    w_ratio = dst_frame_width / width
+    scale = min(h_ratio, w_ratio)
+    if h_ratio < w_ratio:
+        scale = h_ratio
+        new_height = dst_frame_height
+        new_width = math.floor(width * scale)
+    else:
+        scale = w_ratio
+        new_height = math.floor(height * scale)
+        new_width = dst_frame_width
+
+    new_height = math.floor(height * scale)
+    new_width = math.floor(width * scale)
+    pad_h = dst_frame_height - new_height
+    pad_w = dst_frame_width - new_width
+
+    pad_top = int(pad_h // 2)
+    pad_bottom = int(pad_h // 2 + pad_h % 2)
+    pad_left = int(pad_w // 2)
+    pad_right = int(pad_w // 2 + pad_w % 2)
+
+    rescaled_image = interpolate(
+        image, size=[int(new_height), int(new_width)], mode="bilinear"
+    )
+    rescaled_padded_image = pad(
+        rescaled_image, (pad_left, pad_right, pad_top, pad_bottom)
+    )
+    padding = (pad_left, pad_top)
+
+    return rescaled_padded_image, scale, padding
+
+def pil_resize_pad(
+    image: Image, dst_size: Tuple[int, int]
+) -> Tuple[Image, float, Tuple[int, int]]:
+    torch_image = preprocess_PIL_image(image)
+    torch_out_image, scale, padding = resize_pad(
+        torch_image,
+        dst_size,
+    )
+    pil_out_image = torch_tensor_to_PIL_image(torch_out_image[0])
+    return (pil_out_image, scale, padding)
+
+# UnetSegmentation class which inherited from the class QNNContext.
+class UnetSegmentation(QNNContext):
+    def Inference(self, input_data):
+        input_datas=[input_data]
+        output_data = super().Inference(input_datas)[0]
+        return output_data
+        
+def Init():
+    global unetsegmentation
+
+    # Config AppBuilder environment.
+    QNNConfig.Config(os.getcwd() + "\\qnn", Runtime.HTP, LogLevel.WARN, ProfilingLevel.BASIC)
+
+    # Instance for UnetSegmentation objects.
+    unetsegmentation_model = "models\\unet_segmentation.bin"
+    unetsegmentation = UnetSegmentation("unetsegmentation", unetsegmentation_model)
+
+def Inference(input_image_path, output_image_path): 
+    # Read and preprocess the image&mask.
+    orig_image = Image.open(input_image_path)
+    image, _, _ = pil_resize_pad(orig_image, (640, 1280))
+    img = preprocess_PIL_image(image).numpy()
+    img = np.transpose(img, (0, 2, 3, 1))
+
+    # Burst the HTP.
+    PerfProfile.SetPerfProfileGlobal(PerfProfile.BURST)
+
+    # Run the inference.
+    out = unetsegmentation.Inference([img])
+    
+    # Reset the HTP.
+    PerfProfile.RelPerfProfileGlobal()
+      
+    # reshape the output
+    out = out.reshape(1, 640, 1280, 2)
+    output_tensor = torch.from_numpy(out)
+    output_tensor = output_tensor.permute(0, 3, 1, 2)
+    
+    # get mask
+    mask = output_tensor.argmax(dim=1)
+
+    # show&save the result
+    output_image = Image.fromarray(mask[0].bool().numpy())
+    output_image.save(output_image_path)
+    output_image.show()
+
+
+def Release():
+    global unetsegmentation
+
+    # Release the resources.
+    del(unetsegmentation)
+
+
+Init()
+
+Inference("input.jpg", "output.jpg")
+
+Release()
diff --git a/samples/python/yolov8_det/README.md b/samples/python/yolov8_det/README.md
new file mode 100644
index 0000000..55f2ab2
--- /dev/null
+++ b/samples/python/yolov8_det/README.md
@@ -0,0 +1,58 @@
+# yolov8_det Sample Code
+
+## Introduction
+This is sample code for using AppBuilder to load yolov8_det QNN model to HTP and execute inference to predicts bounding boxes and classes of objects in an image.
+
+## Setup AppBuilder environment and prepare QNN SDK libraries by referring to the links below: 
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
+
+Copy the QNN libraries from QNN SDK to below path:
+```
+C:\ai-hub\yolov8_det\qnn\libQnnHtpV73Skel.so
+C:\ai-hub\yolov8_det\qnn\QnnHtp.dll
+C:\ai-hub\yolov8_det\qnn\QnnHtpV73Stub.dll
+C:\ai-hub\yolov8_det\qnn\QnnSystem.dll
+C:\ai-hub\yolov8_det\qnn\libqnnhtpv73.cat
+```
+
+## yolov8_det QNN models
+Download the quantized yolov8_det QNN models from Qualcomm® AI Hub:
+https://aihub.qualcomm.com/compute/models/yolov8_det
+
+After downloaded the model, copy it to the following path:
+```
+"C:\ai-hub\yolov8_det\models\yolov8_det.bin"
+```
+
+## Run the sample code
+Download the sample code from the following link:
+https://github.com/quic/ai-engine-direct-helper/blob/main/Samples/yolov8_det/yolov8_det.py
+
+After downloaded the sample code, please copy it to the following path:
+```
+C:\ai-hub\yolov8_det\
+```
+
+Copy one sample 640x640 image to following path:
+```
+C:\ai-hub\yolov8_det\in.jpg
+```
+
+Run the sample code:
+```
+python yolov8_det.py
+```
+
+## Output
+The output image will be saved to the following path:
+```
+C:\ai-hub\yolov8_det\out.jpg
+```
+
+## Reference
+You need to setup the AppBuilder environment before you run the sample code. Below is the guide on how to setup the AppBuilder environment:
+https://github.com/quic/ai-engine-direct-helper/blob/main/README.md
+https://github.com/quic/ai-engine-direct-helper/blob/main/Docs/User_Guide.md
+
+
diff --git a/samples/python/yolov8_det/input.jpg b/samples/python/yolov8_det/input.jpg
new file mode 100644
index 0000000..fcb4613
Binary files /dev/null and b/samples/python/yolov8_det/input.jpg differ
diff --git a/samples/python/yolov8_det/yolov8_det.py b/samples/python/yolov8_det/yolov8_det.py
new file mode 100644
index 0000000..8cfdb74
--- /dev/null
+++ b/samples/python/yolov8_det/yolov8_det.py
@@ -0,0 +1,366 @@
+# ---------------------------------------------------------------------
+# Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+# ---------------------------------------------------------------------
+
+import os
+import cv2
+import numpy as np
+import torch
+from torch.nn.functional import interpolate, pad
+from torchvision.ops import nms # nms from torch is not avaliable
+import torchvision.transforms as transforms
+from PIL import Image
+from PIL.Image import fromarray as ImageFromArray
+from typing import List, Tuple, Optional, Union, Callable
+from qai_appbuilder import (QNNContext, Runtime, LogLevel, ProfilingLevel, PerfProfile, QNNConfig)
+
+nms_score_threshold: float = 0.45
+nms_iou_threshold: float = 0.7
+yolov8 = None
+
+# define class type
+class_map = {
+    0: "person",
+    1: "bicycle",
+    2: "car",
+    3: "motorcycle",
+    4: "airplane",
+    5: "bus",
+    6: "train",
+    7: "truck",
+    8: "boat",
+    9: "traffic light",
+    10: "fire hydrant",
+    11: "stop sign",
+    12: "parking meter",
+    13: "bench",
+    14: "bird",
+    15: "cat",
+    16: "dog",
+    17: "horse",
+    18: "sheep",
+    19: "cow",
+    20: "elephant",
+    21: "bear",
+    22: "zebra",
+    23: "giraffe",
+    24: "backpack",
+    25: "umbrella",
+    26: "handbag",
+    27: "tie",
+    28: "suitcase",
+    29: "frisbee",
+    30: "skis",
+    31: "snowboard",
+    32: "sports ball",
+    33: "kite",
+    34: "baseball bat",
+    35: "baseball glove",
+    36: "skateboard",
+    37: "surfboard",
+    38: "tennis racket",
+    39: "bottle",
+    40: "wine glass",
+    41: "cup",
+    42: "fork",
+    43: "knife",
+    44: "spoon",
+    45: "bowl",
+    46: "banana",
+    47: "apple",
+    48: "sandwich",
+    49: "orange",
+    50: "broccoli",
+    51: "carrot",
+    52: "hot dog",
+    53: "pizza",
+    54: "donut",
+    55: "cake",
+    56: "chair",
+    57: "couch",
+    58: "potted plant",
+    59: "bed",
+    60: "dining table",
+    61: "toilet",
+    62: "tv",
+    63: "laptop",
+    64: "mouse",
+    65: "remote",
+    66: "keyboard",
+    67: "cell phone",
+    68: "microwave",
+    69: "oven",
+    70: "toaster",
+    71: "sink",
+    72: "refrigerator",
+    73: "book",
+    74: "clock",
+    75: "vase",
+    76: "scissors",
+    77: "teddy bear",
+    78: "hair drier",
+    79: "toothbrush"
+}
+
+def preprocess_PIL_image(image: Image) -> torch.Tensor:
+    """Convert a PIL image into a pyTorch tensor with range [0, 1] and shape NCHW."""
+    transform = transforms.Compose([transforms.PILToTensor()])  # bgr image
+    img: torch.Tensor = transform(image)  # type: ignore
+    img = img.float().unsqueeze(0) / 255.0  # int 0 - 255 to float 0.0 - 1.0
+    return img
+
+def torch_tensor_to_PIL_image(data: torch.Tensor) -> Image:
+    """
+    Convert a Torch tensor (dtype float32) with range [0, 1] and shape CHW into PIL image CHW
+    """
+    out = torch.clip(data, min=0.0, max=1.0)
+    np_out = (out.permute(1, 2, 0).detach().numpy() * 255).astype(np.uint8)
+    return ImageFromArray(np_out)
+
+def custom_nms(boxes, scores, iou_threshold):
+    '''
+    self definition of nms function cause nms from torch is not avaliable on this device without cuda
+    '''
+    
+    if len(boxes) == 0:
+        return torch.empty((0,), dtype=torch.int64)
+    
+    # transfer to numpy array
+    boxes_np = boxes.cpu().numpy()
+    scores_np = scores.cpu().numpy()
+
+    # get the coor of boxes
+    x1 = boxes_np[:, 0]
+    y1 = boxes_np[:, 1]
+    x2 = boxes_np[:, 2]
+    y2 = boxes_np[:, 3]
+
+    # compute the area of each single boxes
+    areas = (x2 - x1 + 1) * (y2 - y1 + 1)
+    order = scores_np.argsort()[::-1]
+
+    keep = []
+    while order.size > 0:
+        i = order[0]
+        keep.append(i)
+        xx1 = np.maximum(x1[i], x1[order[1:]])
+        yy1 = np.maximum(y1[i], y1[order[1:]])
+        xx2 = np.minimum(x2[i], x2[order[1:]])
+        yy2 = np.minimum(y2[i], y2[order[1:]])
+
+        w = np.maximum(0.0, xx2 - xx1 + 1)
+        h = np.maximum(0.0, yy2 - yy1 + 1)
+        inter = w * h
+        ovr = inter / (areas[i] + areas[order[1:]] - inter)
+
+        inds = np.where(ovr <= iou_threshold)[0]
+        order = order[inds + 1]
+
+    return torch.tensor(keep, dtype=torch.int64)
+
+def batched_nms(
+    iou_threshold: float,
+    score_threshold: float,
+    boxes: torch.Tensor,
+    scores: torch.Tensor,
+    *gather_additional_args,
+) -> Tuple[List[torch.Tensor], ...]:
+    """
+    Non maximum suppression over several batches.
+
+    Inputs:
+        iou_threshold: float
+            Intersection over union (IoU) threshold
+
+        score_threshold: float
+            Score threshold (throw away any boxes with scores under this threshold)
+
+        boxes: torch.Tensor
+            Boxes to run NMS on. Shape is [B, N, 4], B == batch, N == num boxes, and 4 == (x1, x2, y1, y2)
+
+        scores: torch.Tensor
+            Scores for each box. Shape is [B, N], range is [0:1]
+
+        *gather_additional_args: torch.Tensor, ...
+            Additional tensor(s) to be gathered in the same way as boxes and scores.
+            In other words, each arg is returned with only the elements for the boxes selected by NMS.
+            Should be shape [B, N, ...]
+
+    Outputs:
+        boxes_out: List[torch.Tensor]
+            Output boxes. This is list of tensors--one tensor per batch.
+            Each tensor is shape [S, 4], where S == number of selected boxes, and 4 == (x1, x2, y1, y2)
+
+        boxes_out: List[torch.Tensor]
+            Output scores. This is list of tensors--one tensor per batch.
+            Each tensor is shape [S], where S == number of selected boxes.
+
+        *args : List[torch.Tensor], ...
+            "Gathered" additional arguments, if provided.
+    """
+    scores_out: List[torch.Tensor] = []
+    boxes_out: List[torch.Tensor] = []
+    args_out: List[List[torch.Tensor]] = (
+        [[] for _ in gather_additional_args] if gather_additional_args else []
+    )
+
+    for batch_idx in range(0, boxes.shape[0]):
+        # Clip outputs to valid scores
+        batch_scores = scores[batch_idx]
+        scores_idx = torch.nonzero(scores[batch_idx] >= score_threshold).squeeze(-1)
+        batch_scores = batch_scores[scores_idx]
+        batch_boxes = boxes[batch_idx, scores_idx]
+        batch_args = (
+            [arg[batch_idx, scores_idx] for arg in gather_additional_args]
+            if gather_additional_args
+            else []
+        )
+
+        if len(batch_scores > 0):
+            nms_indices = custom_nms(batch_boxes[..., :4], batch_scores, iou_threshold)
+            batch_boxes = batch_boxes[nms_indices]
+            batch_scores = batch_scores[nms_indices]
+            batch_args = [arg[nms_indices] for arg in batch_args]
+
+        boxes_out.append(batch_boxes)
+        scores_out.append(batch_scores)
+        for arg_idx, arg in enumerate(batch_args):
+            args_out[arg_idx].append(arg)
+
+    return boxes_out, scores_out, *args_out
+
+def draw_box_from_xyxy(
+    frame: np.ndarray,
+    top_left: np.ndarray | torch.Tensor | Tuple[int, int],
+    bottom_right: np.ndarray | torch.Tensor | Tuple[int, int],
+    color: Tuple[int, int, int] = (0, 0, 0),
+    size: int = 3,
+    text: Optional[str] = None,
+):
+    """
+    Draw a box using the provided top left / bottom right points to compute the box.
+
+    Parameters:
+        frame: np.ndarray
+            np array (H W C x uint8, BGR)
+
+        box: np.ndarray | torch.Tensor
+            array (4), where layout is
+                [xc, yc, h, w]
+
+        color: Tuple[int, int, int]
+            Color of drawn points and connection lines (RGB)
+
+        size: int
+            Size of drawn points and connection lines BGR channel layout
+
+        text: None | str
+            Overlay text at the top of the box.
+
+    Returns:
+        None; modifies frame in place.
+    """
+    if not isinstance(top_left, tuple):
+        top_left = (int(top_left[0].item()), int(top_left[1].item()))
+    if not isinstance(bottom_right, tuple):
+        bottom_right = (int(bottom_right[0].item()), int(bottom_right[1].item()))
+    cv2.rectangle(frame, top_left, bottom_right, color, size)
+    if text is not None:
+        cv2.putText(
+            frame,
+            text,
+            (top_left[0], top_left[1] - 10),
+            cv2.FONT_HERSHEY_SIMPLEX,
+            0.5,
+            color,
+            size,
+        )
+
+# YoloV8 class which inherited from the class QNNContext.
+class YoloV8(QNNContext):
+    def Inference(self, input_data):
+        input_datas=[input_data]
+        output_data = super().Inference(input_datas)    
+        return output_data
+
+def Init():
+    global yolov8
+
+    # Config AppBuilder environment.
+    QNNConfig.Config(os.getcwd() + "\\qnn", Runtime.HTP, LogLevel.WARN, ProfilingLevel.BASIC)
+
+    # Instance for YoloV8 objects.
+    yolov8_model = "models\\yolov8_det.bin"
+    yolov8 = YoloV8("yolov8", yolov8_model)
+
+def Inference(input_image_path, output_image_path):
+    global image_buffer, nms_iou_threshold, nms_score_threshold
+
+    # Read and preprocess the image.
+    image = Image.open(input_image_path)
+    image = image.resize((640, 640))
+    outputImg = Image.open(input_image_path)
+    outputImg = outputImg.resize((640, 640))
+    image = preprocess_PIL_image(image) # transfer raw image to torch tensor format
+    image  = image.permute(0, 2, 3, 1)
+    image = image.numpy()
+
+    output_image = np.array(outputImg.convert("RGB"))  # transfer to numpy array
+
+    # Burst the HTP.
+    PerfProfile.SetPerfProfileGlobal(PerfProfile.BURST)
+
+    # Run the inference.
+    model_output = yolov8.Inference([image])
+    pred_boxes = torch.tensor(model_output[0].reshape(1, -1, 4))
+    pred_scores = torch.tensor(model_output[1].reshape(1, -1))
+    pred_class_idx = torch.tensor(model_output[2].reshape(1, -1))
+
+    # Reset the HTP.
+    PerfProfile.RelPerfProfileGlobal()
+
+    # Non Maximum Suppression on each batch
+    pred_boxes, pred_scores, pred_class_idx = batched_nms(
+        nms_iou_threshold,
+        nms_score_threshold,
+        pred_boxes,
+        pred_scores,
+        pred_class_idx,
+    )
+    
+    # Add boxes to each batch
+    for batch_idx in range(len(pred_boxes)):
+        pred_boxes_batch = pred_boxes[batch_idx]
+        pred_scores_batch = pred_scores[batch_idx]
+        pred_class_idx_batch = pred_class_idx[batch_idx]
+        for box, score, class_idx in zip(pred_boxes_batch, pred_scores_batch, pred_class_idx_batch):
+            class_idx_item = class_idx.item() 
+            class_name = class_map.get(class_idx_item, "Unknown")
+
+            draw_box_from_xyxy(
+                output_image,
+                box[0:2].int(),
+                box[2:4].int(),
+                color=(0, 255, 0),
+                size=2,
+                text=f'{score.item():.2f} {class_name}'
+            )
+
+    #save and display the output_image
+    output_image = Image.fromarray(output_image)
+    output_image.save(output_image_path)
+    output_image.show()
+
+def Release():
+    global yolov8
+
+    # Release the resources.
+    del(yolov8)
+
+
+Init()
+
+Inference("input.jpg", "output.jpg")
+
+Release()