Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLD2 cannot classify text that doesn't have spaces #61

Open
choket opened this issue Jun 18, 2019 · 0 comments
Open

CLD2 cannot classify text that doesn't have spaces #61

choket opened this issue Jun 18, 2019 · 0 comments

Comments

@choket
Copy link

choket commented Jun 18, 2019

The following text gets properly detected as English, with a percent of 99:

the soldier with the green whiskers led them through the streets of the emerald city until they reached the room where the guardian of the gates lived this office run locked their spectacles to put them back in his great box and then he

However that same text, but with spaces removed:

thesoldierwiththegreenwhiskersledthemthroughthestreetsoftheemeraldcityuntiltheyreachedtheroomwheretheguardianofthegateslivedthisofficerunlockedtheirspectaclestoputthembackinhisgreatboxandthenhe

gets classified as English(because that's the default), but is_reliable is set to false and the percentage is 0.
Upon further inspection, using the function DetectLanguageSummary, the 3 most likely languages are all UNKNOWN_LANGUAGE, and the percentages for them are all 0.

Since CLD2 uses quadgrams to analyze latin scripts, the whitespaces should matter very little(if at all) when detecting the language.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant