-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparision with faster-whisper #1127
Comments
I've just submitted a pull request that aims to address a few of these issues. Currently, OpenBLAS isn't active on the Windows platform, even though the previously released binary file is named whisper-blas-bin-x64. When OpenBLAS is enabled, it boosts CPU inferencing speeds by a factor of 3-4. I ran some tests on my i7-12700H, using the -w 2 flag for matrix multiplication. I found that it achieves at least 50% of the theoretical maximum (OpenBLAS enabled). |
Does this mean we're at 30 seconds compared to faster-whispher's 14 seconds? |
For this example there are 2 main reasons explaining why faster-whisper is faster:
Disclaimer: I'm the author of faster-whisper. |
Just throwing in there that faster-whisper is quicker than whisper.cpp on the GPU as well. Using an RTX 4080 on Ubuntu 22.04, a 12min audio sample takes 3.4min to transcribe using whisper.cpp with a |
@guillaumekln has batched beam search still not been implemented? |
The related issue #1048 is still open so I don't think it is implemented yet. |
Going to take a crack at bringing over the implementation from llama. Fair warning, I am not very experienced with C/C++. Will link the PR here once ready for review. |
Could you run another test on the latest version of whisper.cpp? I'm curious to see how much we've improved since last month. You can find the latest version in PR #1243. Thanks! @geekodour Please use OpenBLAS (64bit) |
Any progress? |
In terms of CPU performance, Please note:
Small model on CPU
|
@bobqianic We should better have the batched decoding implemented before additional tests. Without it |
Agree. |
No progress on this yet from me. Will update here with draft PR when I have something. |
I did some measurements of my own that I'd like to share along with some observations and maybe perf recommendations for this great project. Audio Context size 1500 (default)
Audio Context size 512 (Seems to strike well accuracy vs performance #166)
Legend: Obsevations:
For reproduction I attach sources of my benchmark as well as patch to Whisper.cpp that was used to expose MEL and replace Encode-Decode steps. Much code here :)#define AUDIO_CONTEXT_SIZE 1500
#define COMPUTE_TYPE FLOAT32
#define BEAM_SIZE 5
std::string StringifyWhisperCpp(std::vector<int16_t> samples)
{
whisper_context_params cparams = whisper_context_default_params();
cparams.use_gpu = false;
whisper_context* ctx = whisper_init_from_file_with_params("\\models\\ggml-base.bin", cparams);
auto samples32 = Int16ToFP32(samples);
auto wparams = whisper_full_default_params(WHISPER_SAMPLING_BEAM_SEARCH);
wparams.n_threads = 1;
wparams.audio_ctx = AUDIO_CONTEXT_SIZE;
wparams.no_timestamps = true;
wparams.print_special = false;
wparams.token_timestamps = false;
wparams.beam_search.beam_size = BEAM_SIZE;
Benchmark(__func__, [&]
{
if (whisper_full(ctx, wparams, samples32.data(), samples32.size()) != 0)
{
throw std::exception("failed to process audio");
}
return 0;
});
std::string output;
const int n_segments = whisper_full_n_segments(ctx);
for (int i = 0; i < n_segments; ++i)
{
output += whisper_full_get_segment_text(ctx, i);
}
whisper_free(ctx);
return output;
}
#ifdef CT2
std::string StringifyCT2(std::vector<int16_t> samples, bool precomputedMEL = false)
{
std::vector<std::vector<size_t>> prompts{ {50258, 50259, 50359, 50363} };
auto mels = WhisperCppMel(nullptr, samples);
try
{
std::string path = "\\faster-whisper\\models\\base";
auto modelFile = ctranslate2::models::Model::load
(
path,
Device::CPU,
0,
ComputeType::COMPUTE_TYPE
);
auto model = ctranslate2::models::WhisperReplica::create_from_model(*modelFile);
auto featuresSize = precomputedMEL ? (dim_t)mels.size() / 80 : AUDIO_CONTEXT_SIZE * 2;
StorageView features { Shape{ {1, 80, featuresSize} }, DataType::FLOAT32, Device::CPU };
whisper_context* whisper_context = nullptr;
if (precomputedMEL)
{
features.copy_from(mels.data(), mels.size(), Device::CPU, /*synchronous*/ true);
}
else
{
whisper_context = WhisperCppMelContext();
}
auto result = Benchmark(__func__, [&]
{
if (precomputedMEL == false)
{
auto mel = WhisperCppMel(whisper_context, samples);
features.copy_from(mel.data(), mel.size(), Device::CPU, /*synchronous*/ true);
}
return model->generate
(
features,
prompts,
models::WhisperOptions
{
.beam_size = BEAM_SIZE,
.patience = 1,
.length_penalty = 1,
.repetition_penalty = 1.01,
.no_repeat_ngram_size = 0,
.max_length = 448,
.sampling_temperature = 1.0,
.return_scores = false,//true,
.return_no_speech_prob = false,//true,
.max_initial_timestamp_index = 50,
.suppress_blank = false,//true,
.suppress_tokens = {-1},
}
);
});
std::string output = "";
for (auto strings : result[0].sequences)
for (auto string : strings)
{
output += string;
}
if(whisper_context != nullptr)
{
whisper_free(whisper_context);
}
return output;
}
catch (std::exception& e)
{
auto message = e.what();
printf("%s\n", message);
}
}
std::string StringifyCT2WithCpp(std::vector<int16_t> samples)
{
std::string path = "\\faster-whisper\\models\\base";
auto modelFile = ctranslate2::models::Model::load
(
path,
Device::CPU,
0,
ComputeType::COMPUTE_TYPE
);
auto model = ctranslate2::models::WhisperReplica::create_from_model(*modelFile);
struct CallbackContext
{
models::WhisperReplica* model;
whisper_context* whisper;
std::string stringOut;
std::vector<whisper_token> lastTokens;
StorageView features;
}
callback_context
{
.model = model.get(),
.features = { Shape {{1, 80, AUDIO_CONTEXT_SIZE * 2}}, DataType::FLOAT32, Device::CPU }
};
whisper_context_params cparams = whisper_context_default_params();
cparams.use_gpu = false;
cparams.custom_encoder.custom_encoder_context = &callback_context;
cparams.custom_encoder.custom_encoder_callback = [](void* custom_encoder_context, ggml_tensor* mel_in, ggml_tensor* encoded_tokens_out)
{
auto& ctx = *(CallbackContext*) custom_encoder_context;
std::vector<std::vector<size_t>> prompts{ {50258, 50259, 50359, 50363} };
assert(mel_in->type == GGML_TYPE_F32);
ctx.features.copy_from((float*)mel_in->data, mel_in->ne[0] * mel_in->ne[1], Device::CPU, /*synchronous*/ true);
auto results = ctx.model->generate
(
ctx.features,
prompts,
models::WhisperOptions
{
.beam_size = BEAM_SIZE,
.patience = 1,
.length_penalty = 1,
.repetition_penalty = 1.01,
.no_repeat_ngram_size = 0,
.max_length = 448,
.sampling_temperature = 1.0,
.return_scores = false,//true,
.return_no_speech_prob = false,//true,
.max_initial_timestamp_index = 50,
.suppress_blank = false,//true,
.suppress_tokens = {-1},
}
);
auto& tokensOut = results[0].sequences_ids;
//assert(encoded_tokens_out->type == GGML_TYPE_I32);
//ggml_backend_tensor_set(encoded_tokens_out, tokensOut.data(), 0, sizeof(size_t) * tokensOut.size());
//for (auto tokens : tokensOut)
//for (auto token : tokens)
//{
// ctx.stringOut += whisper_token_to_str(ctx.whisper, token);
//}
ctx.lastTokens.clear();
for (auto tokens : results[0].sequences_ids)
for (auto token : tokens)
{
ctx.lastTokens.push_back((whisper_token)token);
}
};
cparams.custom_encoder.custom_decoder_context = &callback_context;
cparams.custom_encoder.custom_decoder_callback = []
(
void* custom_decoder_context,
void* submit_tokens_context,
void(*submit_tokens)(void* submit_tokens_context, whisper_token* tokens, size_t tokens_count)
)
{
auto& ctx = *(CallbackContext*) custom_decoder_context;
submit_tokens(submit_tokens_context, ctx.lastTokens.data(), ctx.lastTokens.size());
};
whisper_context* ctx = whisper_init_from_file_with_params("\\models\\ggml-base.bin", cparams);
callback_context.whisper = ctx;
auto samples32 = Int16ToFP32(samples);
auto wparams = whisper_full_default_params(WHISPER_SAMPLING_BEAM_SEARCH);
wparams.n_threads = 1;
wparams.audio_ctx = AUDIO_CONTEXT_SIZE;
wparams.no_timestamps = true;
wparams.print_special = false;
wparams.token_timestamps = false;
wparams.beam_search.beam_size = BEAM_SIZE;
Benchmark(__func__, [&]
{
if (whisper_full(ctx, wparams, samples32.data(), samples32.size()) != 0)
{
throw std::exception("failed to process audio");
}
return 0;
});
std::string output;
const int n_segments = whisper_full_n_segments(ctx);
for (int i = 0; i < n_segments; ++i)
{
output += whisper_full_get_segment_text(ctx, i);
}
whisper_free(ctx);
return output;
}
template<class T>
auto Benchmark(const char* name, T&& callback)
{
printf("%s\n", name);
using time = decltype(ggml_time_ms());
// Wamup run
auto firstResult = callback();
std::vector<time> runs;
for (int i = 0; i < 10; ++i)
{
auto begin = ggml_time_ms();
callback();
auto elapsed = ggml_time_ms() - begin;
runs.push_back(elapsed);
}
auto sum = std::accumulate(runs.begin(), runs.end(), 0);
auto mean = (float)sum / (float)runs.size();
printf("%.2f;", mean);
float variance = 0;
for (auto run : runs)
{
auto diff = run - mean;
variance += diff * diff;
}
auto stdDev = sqrt(variance / (float)(runs.size() - 1));
printf("%.2f;", stdDev);
for (auto run : runs)
{
printf("%i;", (int)run);
}
printf("\n");
return firstResult;
}
std::vector<float> FilterMel(const std::vector<float>& raw_mels, int n_ctx = AUDIO_CONTEXT_SIZE)
{
auto mel_offset = 0;
auto mel_inp_n_mel = 80;
auto mel_inp_n_len = (int)(raw_mels.size() / mel_inp_n_mel);
const int i0 = std::min(mel_offset, mel_inp_n_len);
const int i1 = std::min(mel_offset + 2 * n_ctx, mel_inp_n_len);
std::vector<float> mels;
mels.resize(mel_inp_n_mel * (i1 - i0));
for (int j = 0; j < mel_inp_n_mel; ++j)
{
for (int i = i0; i < i1; ++i)
{
mels[j * 2 * n_ctx + (i - i0)] = raw_mels[j * mel_inp_n_len + i];
}
}
return mels;
}
whisper_context* WhisperCppMelContext()
{
whisper_context_params cparams = whisper_context_default_params();
cparams.use_gpu = false;
return whisper_init_from_file_with_params("\\models\\ggml-base.bin", cparams);
}
std::vector<float> WhisperCppMel(whisper_context* acceptedContext, std::vector<int16_t> samples, int contextSize = AUDIO_CONTEXT_SIZE)
{
whisper_context* ctx = acceptedContext;
if (ctx == nullptr)
{
ctx = WhisperCppMelContext();
}
auto samples32 = Int16ToFP32(samples);
if (whisper_pcm_to_mel(ctx, samples32.data(), samples32.size(), /*n_threads*/1) != 0)
{
fprintf(stderr, "failed to process audio\n");
return {};
}
auto* mel = whisper_extract_mel(ctx);
auto mels = FilterMel(mel->data, contextSize);
if (acceptedContext == nullptr)
{
whisper_free(ctx);
}
return mels;
}
diff --git a/whisper.cpp b/whisper.cpp
index f601197..d39275d 100644
--- a/whisper.cpp
+++ b/whisper.cpp
@@ -351,6 +351,7 @@ static const std::map<std::string, std::pair<int, std::string>> g_lang = {
{ "yue", { 99, "cantonese", } },
};
+/*
struct whisper_mel {
int n_len;
int n_len_org;
@@ -358,6 +359,7 @@ struct whisper_mel {
std::vector<float> data;
};
+*/
struct whisper_filters {
int32_t n_mel;
@@ -815,6 +817,8 @@ struct whisper_state {
whisper_openvino_context * ctx_openvino = nullptr;
#endif
+ whisper_custom_encoder custom_encoder;
+
// [EXPERIMENTAL] token-level timestamps data
int64_t t_beg = 0;
int64_t t_last = 0;
@@ -1625,7 +1629,7 @@ static bool whisper_encode_external(const whisper_state & wstate) {
const bool use_openvino = wstate.ctx_openvino != nullptr;
#endif
- return use_coreml || use_openvino;
+ return use_coreml || use_openvino || wstate.custom_encoder.custom_encoder_callback != nullptr;
}
static struct ggml_cgraph * whisper_build_graph_conv(
@@ -2059,6 +2063,13 @@ static bool whisper_encode_internal(
whisper_coreml_encode(wstate.ctx_coreml, mel->ne[0], mel->ne[1], (float *) mel->data, (float *) wstate.embd_enc->data);
#elif defined(WHISPER_USE_OPENVINO)
whisper_openvino_encode(wstate.ctx_openvino, mel, wstate.embd_enc);
+#else
+ wstate.custom_encoder.custom_encoder_callback
+ (
+ wstate.custom_encoder.custom_encoder_context,
+ mel,
+ wstate.embd_enc
+ );
#endif
}
}
@@ -2758,7 +2769,7 @@ static void log_mel_spectrogram_worker_thread(int ith, const std::vector<float>
// ref: https://github.com/openai/whisper/blob/main/whisper/audio.py#L110-L157
static bool log_mel_spectrogram(
- whisper_state & wstate,
+ int64_t* time_out,
const float * samples,
const int n_samples,
const int /*sample_rate*/,
@@ -2837,7 +2848,10 @@ static bool log_mel_spectrogram(
mel.data[i] = (mel.data[i] + 4.0)/4.0;
}
- wstate.t_mel_us += ggml_time_us() - t_start_us;
+ if(time_out != nullptr)
+ {
+ *time_out += ggml_time_us() - t_start_us;
+ }
// Dump log_mel_spectrogram
if (debug) {
@@ -2853,6 +2867,37 @@ static bool log_mel_spectrogram(
return true;
}
+static bool log_mel_spectrogram
+(
+ whisper_state& wstate,
+ const float* samples,
+ const int n_samples,
+ const int sample_rate,
+ const int frame_size,
+ const int frame_step,
+ const int n_mel,
+ const int n_threads,
+ const whisper_filters& filters,
+ const bool debug,
+ whisper_mel& mel
+)
+{
+ return log_mel_spectrogram
+ (
+ &wstate.t_mel_us,
+ samples,
+ n_samples,
+ sample_rate,
+ frame_size,
+ frame_step,
+ n_mel,
+ n_threads,
+ filters,
+ debug,
+ mel
+ );
+}
+
// split text into tokens
//
// ref: https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/encoder.py#L53
@@ -2970,6 +3015,8 @@ struct whisper_state * whisper_init_state(whisper_context * ctx) {
whisper_state * state = new whisper_state;
+ state->custom_encoder = ctx->params.custom_encoder;
+
state->backend = whisper_backend_init(ctx->params);
if (!state->backend) {
WHISPER_LOG_ERROR("%s: whisper_backend_init() failed\n", __func__);
@@ -3052,13 +3099,20 @@ struct whisper_state * whisper_init_state(whisper_context * ctx) {
}
// encoder allocator
- if (!whisper_encode_external(*state)) {
- bool ok = whisper_allocr_graph_init(state->alloc_encode, ctx->backend,
- [&]() {
- return whisper_build_graph_encoder(*ctx, *state);
- });
+ if (!whisper_encode_external(*state))
+ {
+ bool ok = whisper_allocr_graph_init
+ (
+ state->alloc_encode,
+ ctx->backend,
+ [&]()
+ {
+ return whisper_build_graph_encoder(*ctx, *state);
+ }
+ );
- if (!ok) {
+ if (!ok)
+ {
WHISPER_LOG_ERROR("%s: failed to init encoder allocator\n", __func__);
whisper_free_state(state);
return nullptr;
@@ -3400,6 +3454,23 @@ int whisper_pcm_to_mel(struct whisper_context * ctx, const float * samples, int
return whisper_pcm_to_mel_with_state(ctx, ctx->state, samples, n_samples, n_threads);
}
+int whisper_pcm_to_mel_no_state
+(
+ const float* samples_in,
+ int n_samples,
+ float* mel_out,
+ size_t* mel_size_in_out,
+ int n_threads
+)
+{
+ //if (!log_mel_spectrogram(*state, samples, n_samples, WHISPER_SAMPLE_RATE, WHISPER_N_FFT, WHISPER_HOP_LENGTH, ctx->model.filters.n_mel, n_threads, ctx->model.filters, false, state->mel)) {
+ WHISPER_LOG_ERROR("%s: failed to compute mel spectrogram\n", __func__);
+ return -1;
+// }
+//
+// return 0;
+}
+
// same as whisper_pcm_to_mel, but applies a Phase Vocoder to speed up the audio x2 (PV without phase lock is not good)
int whisper_pcm_to_mel_phase_vocoder_with_state(struct whisper_context * ctx, struct whisper_state * state, const float * samples, int n_samples, int n_threads) {
if (!log_mel_spectrogram(*state, samples, n_samples, WHISPER_SAMPLE_RATE, 2 * WHISPER_N_FFT, 2 * WHISPER_HOP_LENGTH, ctx->model.filters.n_mel, n_threads, ctx->model.filters, false, state->mel)) {
@@ -3453,6 +3524,11 @@ int whisper_set_mel(
return whisper_set_mel_with_state(ctx, ctx->state, data, n_len, n_mel);
}
+whisper_mel* whisper_extract_mel(struct whisper_context* ctx)
+{
+ return &ctx->state->mel;
+}
+
int whisper_encode_with_state(struct whisper_context * ctx, struct whisper_state * state, int offset, int n_threads) {
if (!whisper_encode_internal(*ctx, *state, offset, n_threads, nullptr, nullptr)) {
WHISPER_LOG_ERROR("%s: failed to eval\n", __func__);
@@ -5181,6 +5257,52 @@ int whisper_full_with_state(
int best_decoder_id = 0;
+ //if (state->custom_encoder.custom_encoder_callback != nullptr)
+ //{
+ // seek += 100 * WHISPER_CHUNK_SIZE;
+ // continue;
+ //}
+
+ if (state->custom_encoder.custom_decoder_callback != nullptr)
+ {
+ auto& decoder = state->decoders[best_decoder_id];
+
+ // TAGS: WHISPER_DECODER_INIT
+ decoder.sequence.tokens.clear();
+ decoder.sequence.result_len = 0;
+ decoder.sequence.sum_logprobs_all = 0.0;
+ decoder.sequence.sum_logprobs = -INFINITY;
+ decoder.sequence.avg_logprobs = -INFINITY;
+ decoder.sequence.entropy = 0.0;
+ decoder.sequence.score = -INFINITY;
+
+ decoder.seek_delta = 100 * WHISPER_CHUNK_SIZE;
+
+ decoder.failed = false;
+ decoder.completed = false;
+ decoder.has_ts = false;
+
+ if (params.grammar_rules != nullptr)
+ {
+ decoder.grammar = whisper_grammar_init(params.grammar_rules, params.n_grammar_rules, params.i_start_rule);
+ }
+ else
+ {
+ decoder.grammar = {};
+ }
+
+ state->custom_encoder.custom_decoder_callback
+ (
+ state->custom_encoder.custom_decoder_context,
+ &decoder,
+ [](void* context, whisper_token* tokens, size_t tokens_count)
+ {
+ auto& decoderTokens = ((whisper_decoder*)context)->sequence.tokens;
+ decoderTokens.insert(decoderTokens.begin(), tokens, tokens + tokens_count);
+ }
+ );
+ }
+ else // TODO: Needs formatting, but I don't want to blow patch
for (int it = 0; it < (int) temperatures.size(); ++it) {
const float t_cur = temperatures[it];
@@ -5685,13 +5807,16 @@ int whisper_full_with_state(
//WHISPER_LOG_DEBUG("prompt_init.size() = %d, prompt.size() = %d, result_len = %d, seek_delta = %d\n", prompt_init.size(), prompt.size(), result_len, seek_delta);
// update prompt_past
- prompt_past.clear();
- if (prompt.front() == whisper_token_prev(ctx)) {
- prompt_past.insert(prompt_past.end(), prompt.begin() + 1, prompt.end() - prompt_init.size());
- }
-
- for (int i = 0; i < result_len; ++i) {
- prompt_past.push_back(tokens_cur[i].id);
+ if(state->custom_encoder.custom_decoder_callback == nullptr)
+ {
+ prompt_past.clear();
+ if (prompt.front() == whisper_token_prev(ctx)) {
+ prompt_past.insert(prompt_past.end(), prompt.begin() + 1, prompt.end() - prompt_init.size());
+ }
+
+ for (int i = 0; i < result_len; ++i) {
+ prompt_past.push_back(tokens_cur[i].id);
+ }
}
if (!tokens_cur.empty() && ctx->model.n_loaded > 0) {
diff --git a/whisper.h b/whisper.h
index a5371eb..708dbad 100644
--- a/whisper.h
+++ b/whisper.h
@@ -6,6 +6,7 @@
#include <stddef.h>
#include <stdint.h>
#include <stdbool.h>
+#include <vector>
#ifdef __GNUC__
# define WHISPER_DEPRECATED(func, hint) func __attribute__((deprecated(hint)))
@@ -34,6 +35,8 @@
#define WHISPER_HOP_LENGTH 160
#define WHISPER_CHUNK_SIZE 30
+#define WHISPER_USE_CT2
+
#ifdef __cplusplus
extern "C" {
#endif
@@ -84,9 +87,30 @@ extern "C" {
typedef int32_t whisper_token;
typedef int32_t whisper_seq_id;
+ struct whisper_custom_encoder
+ {
+ void* custom_encoder_context;
+ void(*custom_encoder_callback)
+ (
+ void* custom_encoder_context,
+ ggml_tensor* mel_in,
+ ggml_tensor* encoded_tokens_out
+ );
+
+ void* custom_decoder_context;
+ void(*custom_decoder_callback)
+ (
+ void* custom_decoder_context,
+ void* submit_tokens_context,
+ void(*submit_tokens)(void* submit_tokens_context, whisper_token* tokens, size_t tokens_count)
+ );
+ };
+
struct whisper_context_params {
bool use_gpu;
int gpu_device; // CUDA device
+
+ whisper_custom_encoder custom_encoder;
};
typedef struct whisper_token_data {
@@ -217,6 +241,15 @@ extern "C" {
int n_samples,
int n_threads);
+ WHISPER_API int whisper_pcm_to_mel_no_state
+ (
+ const float* samples_in,
+ int n_samples,
+ float* mel_out,
+ size_t* mel_size_in_out,
+ int n_threads
+ );
+
WHISPER_API int whisper_pcm_to_mel_with_state(
struct whisper_context * ctx,
struct whisper_state * state,
@@ -257,6 +290,16 @@ extern "C" {
int n_len,
int n_mel);
+ struct whisper_mel {
+ int n_len;
+ int n_len_org;
+ int n_mel;
+
+ std::vector<float> data;
+ };
+
+ WHISPER_API whisper_mel* whisper_extract_mel(struct whisper_context* ctx);
+
// Run the Whisper encoder on the log mel spectrogram stored inside the default state in the provided whisper context.
// Make sure to call whisper_pcm_to_mel() or whisper_set_mel() first.
// offset can be used to specify the offset of the first frame in the spectrogram.
Disclaimers:
|
That's FP32. I believe if you give the OpenBLAS version a try, you'll find its performance quite similar to CT2, with hardly any noticeable difference. (v1.5.4) |
Great, I completely missed that in docs, thanks. So for the BLAS backend I added measurements #27 and #28 to the tables above. In the ggml code I also noticed there is some support for MKL too, tried to measure it, but it's throwing on me, not sure why.
Full context of locals
|
It looks like compiled with OpenBLAS is actually worse on Raspberry Pi 5 (1743.21 ms vs. 6232.27 ms - I ran jfk sample a few times, while the numbers differed slightly, the overall result was the same). |
The issue is different, for me: with whisper.cpp and the medium model, the model choked and made a mess. Every few minutes I had to restart it. Horrible results. I therefore think that whisper.cpp is flawed and badly, but I don't know where and how. For reference I used this version: https://github.com/Purfview/whisper-standalone-win/releases/download/Faster-Whisper-XXL/Faster-Whisper-XXL_r194.5_windows.7z |
I am currently researching https://github.com/Sharrnah/whispering configs with Whisper Large V3 |
I did a very rough comparison of https://github.com/guillaumekln/faster-whisper and whisper.cpp, turns out faster-whisper is faster than whisper.cpp in CPU.
For eg. It takes faster-whisper 14seconds with the
small.en
, whereas with whisper.cpp it's 46seconds. What causes this slowness? Or I am not setting parameters correctly, I tried keeping the beam size and threads similar.I have a suspicion that I am not doing the comparison correctly, it'll be awesome if someone more knowledgeable can tell why faster-whisper is faster on CPU
I think I am comparing int8(faster-whisper) to int4(https://huggingface.co/ggerganov/whisper.cpp) quantization here. But not sure how much of a difference should it make.
See comparison here:
https://gist.github.com/geekodour/8734b3bf22b8ede61fb5bfc92ce68fe3
The text was updated successfully, but these errors were encountered: