Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] New language detection #874

Closed
wants to merge 3 commits into from
Closed

Conversation

djukicn
Copy link
Collaborator

@djukicn djukicn commented Jun 30, 2022

Issue

Implements the new approach to language detection in the add-on.
Fixes #583

Description of changes
  • The function for language detection and dict of supported languages (all languages supported by any method in Addon)
  • Adoption of corpus to store language setting. Corpus can have the language set to the iso code of the language or None (for languages that are not in the list of supported languages - language-dependent methods do not work, but all other methods work)
  • Adopting "input" widgets to set the language of the Corpus
  • Adopting widgets which uses language setting (removing language controls, handling not supported languages)
  • Removed Stafornford POS-tagging method since we do not use it and it is deprecated in NLTK
Includes
  • Code changes
  • Tests
  • Documentation
Comments to the reviewer
  • In the Create Corpus widget user must select a language manually
  • Twitter widget use language set by the user; if language is not set and all tweets have the same language, it is set as Corpus'es language. When tweets are in different languages, language is set to None. I also changed the Twitter widget that variables are initialized when Corpus is built. Before, the same variable instance was used multiple times, and those settings were shared (values, ....).
  • UDPipe offers various variants of models for the same language. Should we still support those variants or use the basic one for each language? Are they used in practice?

@djukicn djukicn marked this pull request as draft June 30, 2022 08:16
@PrimozGodec PrimozGodec changed the title New language detection [ENH] New language detection Jun 30, 2022
@PrimozGodec PrimozGodec force-pushed the langdetect branch 2 times, most recently from 4b93fb2 to 3c3de95 Compare July 20, 2022 15:00
@codecov-commenter
Copy link

codecov-commenter commented Jul 20, 2022

Codecov Report

Merging #874 (e061cd7) into master (23da347) will decrease coverage by 0.73%.
The diff coverage is 89.97%.

❗ Current head e061cd7 differs from pull request most recent head 501c135. Consider uploading reports for the commit 501c135 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #874      +/-   ##
==========================================
- Coverage   77.82%   77.10%   -0.73%     
==========================================
  Files          87       86       -1     
  Lines       12338    12014     -324     
  Branches     1624     1570      -54     
==========================================
- Hits         9602     9263     -339     
- Misses       2434     2459      +25     
+ Partials      302      292      -10     

@PrimozGodec PrimozGodec force-pushed the langdetect branch 4 times, most recently from 718562b to 610a756 Compare July 26, 2022 10:59
@PrimozGodec PrimozGodec force-pushed the langdetect branch 19 times, most recently from 0315e11 to 534b2cc Compare August 19, 2022 14:15
@ajdapretnar
Copy link
Collaborator

I agree with

In place of a language dropdown, we can add a checkbox Use default stop words or something similar

@ajdapretnar
Copy link
Collaborator

I think it is not an error caused by this PR.

You are right, I cannot reproduce this anymore. Ignore the comment.

Copy link
Contributor

@VesnaT VesnaT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some strange behaviour regarding corpus.language property and Corpus ContexHandler:

  1. saved setting probably should not overwrite the input corpus.language (widget Corpus (1))
  2. the corpus.language has been changed globally (widget Python Script (6))
    image

@@ -75,13 +76,28 @@ class StopwordsFilter(BaseTokenFilter, FileWordListMixin):
""" Remove tokens present in NLTK's language specific lists or a file. """
name = 'Stopwords'

# nltk uses different language nams for some languages
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nams -> names

@@ -0,0 +1 @@
language: en
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new line is missing at the end of the file. The same goes for all *.tab.metadata files.

@@ -478,7 +485,7 @@ def copy(self):

@staticmethod
def from_documents(documents, name, attributes=None, class_vars=None, metas=None,
title_indices=None):
title_indices=None, language=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Language is missing in the docstring.

@@ -61,7 +61,7 @@ def __init__(self, *args, **kwargs):

# Language
row += 1
language_edit = ComboBox(self, 'language', tuple(sorted(lang2code.items())))
language_edit = ComboBox(self, 'language', tuple(sorted(LANG2ISO.items())))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placing the Wikipedia widget on canvas results in an error:


Traceback (most recent call last):
  File "/Users/vesna/orange-canvas-core/orangecanvas/scheme/widgetmanager.py", line 236, in __add_widget_for_node
    w = self.create_widget_for_node(node)
  File "/Users/vesna/orange-widget-base/orangewidget/workflow/widgetsscheme.py", line 300, in create_widget_for_node
    widget = self.create_widget_instance(node)
  File "/Users/vesna/orange-widget-base/orangewidget/workflow/widgetsscheme.py", line 413, in create_widget_instance
    widget.__init__()
  File "/Users/vesna/orange3-text/orangecontrib/text/widgets/owwikipedia.py", line 64, in __init__
    language_edit = ComboBox(self, 'language', tuple(sorted(LANG2ISO.items())))
TypeError: '<' not supported between instances of 'NoneType' and 'str'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Set smart defaults with language detection
5 participants