[s2t] add whisper asr large model (#2640)

* add whisper asr large model decoding, test=asr * fix code style. * fix json code style. * remove resource and fix code style. * fix yapf * add cli and demos, fix some code style. * fix some problem by comment. * fix yapf
PaddlePaddle · Nov 18, 2022 · b1d3f59 · b1d3f59
1 parent dc9d3ba
commit b1d3f59
Show file tree

Hide file tree

Showing 16 changed files with 2,789 additions and 3 deletions.
diff --git a/demos/whisper/README.md b/demos/whisper/README.md
@@ -0,0 +1,89 @@
+([简体中文](./README_cn.md)|English)
+
+## Introduction
+Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.
+
+Whisper model trained by OpenAI whisper https://github.com/openai/whisper
+
+## Usage
+ ### 1. Installation
+ see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
+
+ You can choose one way from easy, meduim and hard to install paddlespeech.
+
+ ### 2. Prepare Input File
+ The input of this demo should be a WAV file(`.wav`), and the sample rate must be the same as the model.
+
+ Here are sample files for this demo that can be downloaded:
+ ```bash
+ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
+ ```
+
+ ### 3. Usage
+ - Command Line(Recommended)
+   ```bash
+   # to recognize text 
+   paddlespeech whisper --task transcribe --input ./zh.wav
+
+   # to recognize text and translate to English
+   paddlespeech whisper --task translate --input ./zh.wav
+   ```
+
+   Usage:
+   ```bash
+   paddlespeech whisper --help
+   ```
+   Arguments:
+   - `input`(required): Audio file to recognize.
+   - `model`: Model type of asr task. Default: `whisper-large`.
+   - `task`: Output type. Default: `transcribe`.
+   - `lang`: Model language. Default: `None`. Forcibly set the recognized language, which is determined by the model itself by default.
+   - `sample_rate`: Sample rate of the model. Default: `16000`. Other sampling rates are not supported now.
+   - `config`: Config of asr task. Use pretrained model when it is None. Default: `None`.
+   - `ckpt_path`: Model checkpoint. Use pretrained model when it is None. Default: `None`.
+   - `yes`: No additional parameters required. Once set this parameter, it means accepting the request of the program by default, which includes transforming the audio sample rate. Default: `False`.
+   - `device`: Choose device to execute model inference. Default: default device of paddlepaddle in current environment.
+   - `verbose`: Show the log information.
+
+
+ - Python API
+   ```python
+   import paddle
+   from paddlespeech.cli.whisper import WhisperExecutor
+
+   whisper_executor = WhisperExecutor()
+
+   # to recognize text 
+   text = whisper_executor(
+       model='whisper-large',
+       task='transcribe',
+       sample_rate=16000,
+       config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
+       ckpt_path=None,
+       audio_file='./zh.wav',
+       device=paddle.get_device())
+   print('ASR Result: \n{}'.format(text))
+
+   # to recognize text and translate to English
+   feature = whisper_executor(
+       model='whisper-large',
+       task='translate',
+       sample_rate=16000,
+       config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
+       ckpt_path=None,
+       audio_file='./zh.wav',
+       device=paddle.get_device())
+   print('Representation: \n{}'.format(feature))
+   ```
+
+   Output:
+   ```bash
+   Transcribe Result:
+   Detected language: Chinese
+   [00:00.000 --> 00:05.000] 我认为跑步最重要的就是给我带来了身体健康
+   {'text': '我认为跑步最重要的就是给我带来了身体健康', 'segments': [{'id': 0, 'seek': 0, 'start': 0.0, 'end': 5.0, 'text': '我认为跑步最重要的就是给我带来了身体健康', 'tokens': [50364, 1654, 7422, 97, 13992, 32585, 31429, 8661, 24928, 1546, 5620, 49076, 4845, 99, 34912, 19847, 29485, 44201, 6346, 115, 50614], 'temperature': 0.0, 'avg_logprob': -0.23577967557040128, 'compression_ratio': 0.28169014084507044, 'no_speech_prob': 0.028302080929279327}], 'language': 'zh'}
+
+   Translate Result:
+   Detected language: Chinese
+   [00:00.000 --> 00:05.000]  I think the most important thing about running is that it brings me good health.
+   {'text': ' I think the most important thing about running is that it brings me good health.', 'segments': [{'id': 0, 'seek': 0, 'start': 0.0, 'end': 5.0, 'text': ' I think the most important thing about running is that it brings me good health.', 'tokens': [50364, 286, 519, 264, 881, 1021, 551, 466, 2614, 307, 300, 309, 5607, 385, 665, 1585, 13, 50614], 'temperature': 0.0, 'avg_logprob': -0.47945233395225123, 'compression_ratio': 1.095890410958904, 'no_speech_prob': 0.028302080929279327}], 'language': 'zh'}
diff --git a/demos/whisper/README_cn.md b/demos/whisper/README_cn.md
@@ -0,0 +1,91 @@
+(简体中文|[English](./README.md))
+
+# Whisper模型
+## 介绍
+Whisper是一种通用的语音识别模型。它是在多种音频的大数据集上训练的，也是一个多任务模型，可以执行多语言语音识别以及语音翻译和语言识别。
+
+Whisper模型由OpenAI Whisper训练 https://github.com/openai/whisper
+
+## 使用方法
+### 1. 安装
+ 请看[安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install_cn.md)。
+
+ 你可以从 easy，medium，hard 三中方式中选择一种方式安装。
+
+### 2. 准备输入
+ 这个 demo 的输入应该是一个 WAV 文件（`.wav`），并且采样率必须与模型的采样率相同。
+
+ 可以下载此 demo 的示例音频：
+ ```bash
+ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
+ ```
+
+### 3. 使用方法
+ - 命令行 (推荐使用)
+   ```bash
+
+   # 识别文本
+   paddlespeech whisper --task transcribe --input ./zh.wav
+
+   # 将语音翻译成英语
+   paddlespeech whisper --task translate --input ./zh.wav
+   ```
+  使用方法：
+   ```bash
+   paddlespeech whisper --help
+   ```
+   参数：
+   - `input`(必须输入)：用于识别的音频文件。
+   - `model`：ASR 任务的模型，默认值：`whisper-large`。
+   - `task`：输出类别，默认值：`transcribe`。
+   - `lang`：模型语言，默认值：`None`，强制设定识别出的语言，默认为模型自行判定。
+   - `sample_rate`：音频采样率，默认值：`16000`，目前Whisper暂不支持其他采样率。
+   - `config`：ASR 任务的参数文件，若不设置则使用预训练模型中的默认配置，默认值：`None`。
+   - `ckpt_path`：模型参数文件，若不设置则下载解码模型使用，默认值：`None`。
+   - `yes`；不需要设置额外的参数，一旦设置了该参数，说明你默认同意程序的所有请求，其中包括自动转换输入音频的采样率。默认值：`False`。
+   - `device`：执行预测的设备，默认值：当前系统下 paddlepaddle 的默认 device。
+   - `verbose`: 如果使用，显示 logger 信息。
+
+
+- Python API
+   ```python
+   import paddle
+   from paddlespeech.cli.whisper import WhisperExecutor
+
+   whisper_executor = WhisperExecutor()
+
+   # 识别文本
+   text = whisper_executor(
+       model='whisper-large',
+       task='transcribe',
+       sample_rate=16000,
+       config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
+       ckpt_path=None,
+       audio_file='./zh.wav',
+       device=paddle.get_device())
+   print('ASR Result: \n{}'.format(text))
+
+    # 将语音翻译成英语
+   feature = whisper_executor(
+       model='whisper-large',
+       task='translate',
+       sample_rate=16000,
+       config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
+       ckpt_path=None,
+       audio_file='./zh.wav',
+       device=paddle.get_device())
+   print('Representation: \n{}'.format(feature))
+   ```
+
+
+   输出：
+   ```bash
+   Transcribe Result:
+   Detected language: Chinese
+   [00:00.000 --> 00:05.000] 我认为跑步最重要的就是给我带来了身体健康
+   {'text': '我认为跑步最重要的就是给我带来了身体健康', 'segments': [{'id': 0, 'seek': 0, 'start': 0.0, 'end': 5.0, 'text': '我认为跑步最重要的就是给我带来了身体健康', 'tokens': [50364, 1654, 7422, 97, 13992, 32585, 31429, 8661, 24928, 1546, 5620, 49076, 4845, 99, 34912, 19847, 29485, 44201, 6346, 115, 50614], 'temperature': 0.0, 'avg_logprob': -0.23577967557040128, 'compression_ratio': 0.28169014084507044, 'no_speech_prob': 0.028302080929279327}], 'language': 'zh'}
+
+   Translate Result:
+   Detected language: Chinese
+   [00:00.000 --> 00:05.000]  I think the most important thing about running is that it brings me good health.
+   {'text': ' I think the most important thing about running is that it brings me good health.', 'segments': [{'id': 0, 'seek': 0, 'start': 0.0, 'end': 5.0, 'text': ' I think the most important thing about running is that it brings me good health.', 'tokens': [50364, 286, 519, 264, 881, 1021, 551, 466, 2614, 307, 300, 309, 5607, 385, 665, 1585, 13, 50614], 'temperature': 0.0, 'avg_logprob': -0.47945233395225123, 'compression_ratio': 1.095890410958904, 'no_speech_prob': 0.028302080929279327}], 'language': 'zh'}
diff --git a/demos/whisper/run.sh b/demos/whisper/run.sh
@@ -0,0 +1,10 @@
+#!/bin/bash
+
+# audio download
+wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
+
+# to recognize text 
+paddlespeech whisper --task transcribe --input ./zh.wav
+
+# to recognize text and translate to English
+paddlespeech whisper --task translate --input ./zh.wav
diff --git a/paddlespeech/cli/base_commands.py b/paddlespeech/cli/base_commands.py
@@ -83,7 +83,8 @@ def execute(self, argv: List[str]) -> bool:
     'st': 'Model-Source language-Target language',
     'text': 'Model-Task-Language',
     'tts': 'Model-Language',
-    'vector': 'Model-Sample Rate'
+    'vector': 'Model-Sample Rate',
+    'whisper': 'Model-Language-Sample Rate'
 }
 
 
@@ -94,7 +95,9 @@ class StatsCommand:
     def __init__(self):
         self.parser = argparse.ArgumentParser(
             prog='paddlespeech.stats', add_help=True)
-        self.task_choices = ['asr', 'cls', 'st', 'text', 'tts', 'vector', 'kws']
+        self.task_choices = [
+            'asr', 'cls', 'st', 'text', 'tts', 'vector', 'kws', 'whisper'
+        ]
         self.parser.add_argument(
             '--task',
             type=str,
@@ -141,6 +144,10 @@ def execute(self, argv: List[str]) -> bool:
     'tts': ['Text to Speech infer command.', 'TTSExecutor'],
     'vector': ['Speech to vector embedding infer command.', 'VectorExecutor'],
     'kws': ['Keyword Spotting infer command.', 'KWSExecutor'],
+    'whisper': [
+        'Whisper model for speech to text or translate speech to English.',
+        'WhisperExecutor'
+    ]
 }
 
 for com, info in _commands.items():

diff --git a/paddlespeech/cli/whisper/__init__.py b/paddlespeech/cli/whisper/__init__.py
@@ -0,0 +1,14 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from .infer import WhisperExecutor