Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

profiles.smからタイ語を抜いてタイ語を判定すると高いProbabilityで日本語と判定される #78

Open
GoogleCodeExporter opened this issue May 28, 2015 · 1 comment

Comments

@GoogleCodeExporter
Copy link

空気読まず日本語で失礼します。

What steps will reproduce the problem?
1. profiles.smからth(タイ語)を抜く
2. 1.のprofiles.smを使いタイ語文章の言語判定を行う

What is the expected output? What do you see instead?
"no features in text"と例外が出ると思いましたが、
   [ja:0.999999...]
というProbabilityを得ました。
もしprofiles.smからjaを抜いた場合はkoがそれに近い値となり��
�koを抜くとやっと
"no features in text"と表示されました。

What version of the product are you using? On what operating system?
ライブラリ、profiles.sm共にmasterのHEAD (Rev. 
a1b65d981fc4)のものを使用しました。

Please provide any additional information below.
添付ファイルは再現ソースです。
    ./gradle run
で実行できます。

内容はタイ語のYouTube動画 (http://youtu.be/FwyND40c3pw) 
のタイトルと説明文を判定するものですが、
    タイトル: [ja:0.9999965012903201]
    説明文:   [ja:0.9999983886531987]
となります。





仕様だったらすみません。

Original issue reported on code.google.com by mshiban...@gmail.com on 12 May 2015 at 11:27

Attachments:

@dennis97519
Copy link

Probably because there are random thai characters used for kaomoji in the Japanese and Korean short message profile? When opening the profile as plain text there are also random arabic characters and stuff.

Maybe try optimaize's language-detector which is modified from this one, since Shuyo doesn't go on here that often to update.

Also, maybe try see the probability for Thai also. There is a list probability function I remember?

日本語が苦手だから英語で答えてった。ごめん。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants