Simplified Chinese rather than traditional? #277
Replies: 10 comments 22 replies
-
The tiny model doesn't perform well on many non-English languages. I'd recommend using the medium or large model for better performance in Chinese. Regarding the simplified/traditional distinction, we used a single language code For example, by prompting in a certain style (i.e.
|
Beta Was this translation helpful? Give feedback.
-
I think it is very strange that the initial prompt is the solution to fix the Simplified VS Traditional Chinese.
I think that the correct way to solve Traditional/Simplified should be to use --language and choose one of the many [zh-hk, zh-cn, zh-sg, zh-tw] just like defined in the ISO 639-1 standard language codes. It's not just a small variation. Many traditional characters are all written the same in Simplified. Therefore if Simplified is outputted, it's not possible to programmatically change to traditional character. It depends on the context of the sentence. |
Beta Was this translation helpful? Give feedback.
-
In language option, there are zh and Chinese. As vopbs said "关于简体/繁体的区分,我们对所有中文变体使用单一的语言代码zh". So what is the diff between zh and Chinese? And it is quite smart. I use |
Beta Was this translation helpful? Give feedback.
-
Got it. I think add the prompt is a good solution.
…------------------ Original ------------------
From: Francoyy ***@***.***>
Date: 周六,8月 5,2023 12:52
To: openai/whisper ***@***.***>
Cc: Kearney ***@***.***>, Mention ***@***.***>
Subject: Re: [openai/whisper] Simplified Chinese rather than traditional? (Discussion #277)
@BackMountainDevil I was not talking about the difference between "zh" and "Chinese", but the fact that there is "Simplified Chinese" and "Traditional Chinese".
Their full notation is usually "zh-TW" (or zh-HK) for traditional chinese, whereas simplified chinese is "zh-Cn" or sometimes more commonly "zh".
The spoken part is the same, but the way it is written is using different characters. You could say that generally, traditional chinese contains more strokes, for example Turtle is 龟 in simplified Chinese, and 龜 in traditional.
But as @no1xsyzy mentioned, it's not a 1:1 mapping, but n:m mapping. So once Whisper outputs Chinese text, there's no way to use a script to automatically translate from simplified to traditional, or vice versa.
Hence the question if it is possible in some way to tell whisper that we would like Simplified or Traditional as output.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
It is advised to label Cantonese as "yue" and Mandarin as "cmn," in accordance with their respective language codes. Continually using "zh" may result in messing up with different languages of Chinese. |
Beta Was this translation helpful? Give feedback.
-
Yeah, too bad… because it’s just a hack but not a guarantee. Most people
don’t understand what Simplified Chinese and Traditional Chinese is… i
guess the ultimate solution for us would be a script using google translate
and converting Traditional and Simplified, after Whisper is done generating
the subtitle…
Kearney ***@***.***>於 2023年12月11日 週一,11:03寫道:
… guys. --initial_prompt "以下是普通话的句子。" tested not work on large-v3. I want
zh-cn(Simplified Chinese) but got Traditional Chinese. Do you have the same
wired thing?
—
Reply to this email directly, view it on GitHub
<#277 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABVJRXP63N4QXXQDAJ52J3YI3K67AVCNFSM6AAAAAARARLKOSVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TQMJXHE3TC>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Unfortunately, no. In terms of audio, it's the same. People have trouble to
understand that.
How could I explain it easily. It's a little bit like deciding to write in
small case, or to decide to write in UPPER CASE.
In China, people use simplified chinese characters to write characters. But
for example in Hong Kong, Taiwan they use traditional chinese characters.
So for the same video, if my audience was in Taiwan, I would want to output
traditional characters. But if my audience is from China, then I want to
get simplified characters.
…On Mon, Feb 5, 2024 at 3:11 PM FarmerL ***@***.***> wrote:
Is it possible to separate "simplified Chinese" and "traditional Chinese"
into two set of language?
—
Reply to this email directly, view it on GitHub
<#277 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABVJRTWBSPC7MWR6EEJKSTYSCAZ5AVCNFSM6AAAAAARARLKOSVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DGNRWGIYDQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Yes, that makes sense to me. That was the original request. But according
to the devs, that's not how Whisper work. Therefore there is a workaround
of providing a sample input to influence the output...
…On Mon, Feb 5, 2024 at 3:32 PM jchenny7 ***@***.***> wrote:
How about if we specify in language whether to output zh-CN (Simplified
Chinese) or zh-TW (Traditional Chinese) instead of consolidate as zh?
—
Reply to this email directly, view it on GitHub
<#277 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABVJRTJHVDD6RLXFW5M6Z3YSCDIBAVCNFSM6AAAAAARARLKOSVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DGNRWGM2TS>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I encountered a similar issue where Whisper kept outputting simplified Chinese despite setting the
I suspect that the Whisper model is just too effective at detecting Chinese accents. ;) |
Beta Was this translation helpful? Give feedback.
-
this is not very accurate.
Beta Was this translation helpful? Give feedback.
All reactions