[ENH] New language detection #874

djukicn · 2022-06-30T08:16:23Z

Issue

Implements the new approach to language detection in the add-on.
Fixes #583

Description of changes

The function for language detection and dict of supported languages (all languages supported by any method in Addon)
Adoption of corpus to store language setting. Corpus can have the language set to the iso code of the language or None (for languages that are not in the list of supported languages - language-dependent methods do not work, but all other methods work)
Adopting "input" widgets to set the language of the Corpus
Adopting widgets which uses language setting (removing language controls, handling not supported languages)
Removed Stafornford POS-tagging method since we do not use it and it is deprecated in NLTK

Includes

Code changes
Tests
Documentation

Comments to the reviewer

In the Create Corpus widget user must select a language manually
Twitter widget use language set by the user; if language is not set and all tweets have the same language, it is set as Corpus'es language. When tweets are in different languages, language is set to None. I also changed the Twitter widget that variables are initialized when Corpus is built. Before, the same variable instance was used multiple times, and those settings were shared (values, ....).
UDPipe offers various variants of models for the same language. Should we still support those variants or use the basic one for each language? Are they used in practice?

codecov-commenter · 2022-07-20T15:01:51Z

Codecov Report

Merging #874 (e061cd7) into master (23da347) will decrease coverage by 0.73%.
The diff coverage is 89.97%.

❗ Current head e061cd7 differs from pull request most recent head 501c135. Consider uploading reports for the commit 501c135 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #874      +/-   ##
==========================================
- Coverage   77.82%   77.10%   -0.73%     
==========================================
  Files          87       86       -1     
  Lines       12338    12014     -324     
  Branches     1624     1570      -54     
==========================================
- Hits         9602     9263     -339     
- Misses       2434     2459      +25     
+ Partials      302      292      -10

ajdapretnar · 2022-10-07T08:03:43Z

I agree with

In place of a language dropdown, we can add a checkbox Use default stop words or something similar

ajdapretnar · 2022-10-07T11:11:47Z

I think it is not an error caused by this PR.

You are right, I cannot reproduce this anymore. Ignore the comment.

VesnaT

There is some strange behaviour regarding corpus.language property and Corpus ContexHandler:

saved setting probably should not overwrite the input corpus.language (widget Corpus (1))
the corpus.language has been changed globally (widget Python Script (6))

VesnaT · 2022-10-28T09:05:30Z

orangecontrib/text/preprocess/filter.py

@@ -75,13 +76,28 @@ class StopwordsFilter(BaseTokenFilter, FileWordListMixin):
    """ Remove tokens present in NLTK's language specific lists or a file. """
    name = 'Stopwords'

+    # nltk uses different language nams for some languages


nams -> names

VesnaT · 2022-10-28T09:13:27Z

orangecontrib/text/datasets/20newsgroups-test.tab.metadata

@@ -0,0 +1 @@
+language: en


A new line is missing at the end of the file. The same goes for all *.tab.metadata files.

VesnaT · 2022-10-28T09:18:23Z

orangecontrib/text/corpus.py

@@ -478,7 +485,7 @@ def copy(self):

    @staticmethod
    def from_documents(documents, name, attributes=None, class_vars=None, metas=None,
-                       title_indices=None):
+                       title_indices=None, language=None):


Language is missing in the docstring.

VesnaT · 2022-10-28T11:06:43Z

orangecontrib/text/widgets/owwikipedia.py

@@ -61,7 +61,7 @@ def __init__(self, *args, **kwargs):

        # Language
        row += 1
-        language_edit = ComboBox(self, 'language', tuple(sorted(lang2code.items())))
+        language_edit = ComboBox(self, 'language', tuple(sorted(LANG2ISO.items())))


Placing the Wikipedia widget on canvas results in an error:

Traceback (most recent call last): File "/Users/vesna/orange-canvas-core/orangecanvas/scheme/widgetmanager.py", line 236, in __add_widget_for_node w = self.create_widget_for_node(node) File "/Users/vesna/orange-widget-base/orangewidget/workflow/widgetsscheme.py", line 300, in create_widget_for_node widget = self.create_widget_instance(node) File "/Users/vesna/orange-widget-base/orangewidget/workflow/widgetsscheme.py", line 413, in create_widget_instance widget.__init__() File "/Users/vesna/orange3-text/orangecontrib/text/widgets/owwikipedia.py", line 64, in __init__ language_edit = ComboBox(self, 'language', tuple(sorted(LANG2ISO.items()))) TypeError: '<' not supported between instances of 'NoneType' and 'str'

djukicn marked this pull request as draft June 30, 2022 08:16

djukicn force-pushed the langdetect branch from 6f8ab87 to 4c0765a Compare June 30, 2022 08:19

PrimozGodec mentioned this pull request Jun 30, 2022

Language detection and smart defaults #701

Closed

6 tasks

PrimozGodec changed the title ~~New language detection~~ [ENH] New language detection Jun 30, 2022

PrimozGodec force-pushed the langdetect branch 2 times, most recently from 4b93fb2 to 3c3de95 Compare July 20, 2022 15:00

PrimozGodec force-pushed the langdetect branch 4 times, most recently from 718562b to 610a756 Compare July 26, 2022 10:59

PrimozGodec force-pushed the langdetect branch 19 times, most recently from 0315e11 to 534b2cc Compare August 19, 2022 14:15

PrimozGodec force-pushed the langdetect branch from a2ce978 to e061cd7 Compare October 6, 2022 15:00

PrimozGodec force-pushed the langdetect branch 3 times, most recently from a370052 to 092558b Compare October 21, 2022 14:38

janezd assigned VesnaT Oct 28, 2022

VesnaT reviewed Oct 28, 2022

View reviewed changes

PrimozGodec marked this pull request as draft November 4, 2022 07:24

PrimozGodec mentioned this pull request Nov 4, 2022

[ENH] Add language to corpus #916

Merged

3 tasks

PrimozGodec force-pushed the langdetect branch from 092558b to c11060b Compare January 11, 2023 07:25

This was referenced Jan 11, 2023

[ENH] Create Corpus - add language to corpus #924

Merged

[ENH] Guardian - infer language and add to corpus #925

Merged

[ENH] NYTimes - add language to corpus #926

Merged

[ENH] PubMed - add language to corpus #927

Merged

PrimozGodec unassigned VesnaT Jan 11, 2023

PrimozGodec mentioned this pull request Jan 17, 2023

[ENH] Wikipedia - add language to corpus #928

Merged

3 tasks

PrimozGodec force-pushed the langdetect branch 2 times, most recently from 829d7eb to 951a90a Compare March 10, 2023 10:41

PrimozGodec mentioned this pull request Mar 10, 2023

[ENH] Document embedding - Use language from the corpus #953

Merged

3 tasks

PrimozGodec force-pushed the langdetect branch from 951a90a to a3fb166 Compare March 10, 2023 10:51

PrimozGodec mentioned this pull request Mar 10, 2023

[ENH] Sentiment Analysis - Language from corpus #954

Merged

3 tasks

PrimozGodec force-pushed the langdetect branch 3 times, most recently from 54b2ca6 to cb929af Compare March 29, 2023 10:39

PrimozGodec added 3 commits April 11, 2023 09:46

Sentiment Analysis - language from corpus

d228bba

Keywords - language from corpus

7b53f1c

Preprocess text - Use Corpus language

501c135

PrimozGodec force-pushed the langdetect branch from cb929af to 501c135 Compare April 11, 2023 07:47

PrimozGodec closed this Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] New language detection #874

[ENH] New language detection #874

djukicn commented Jun 30, 2022 •

edited by PrimozGodec

Loading

codecov-commenter commented Jul 20, 2022 •

edited

Loading

ajdapretnar commented Oct 7, 2022

ajdapretnar commented Oct 7, 2022

VesnaT left a comment •

edited

Loading

VesnaT Oct 28, 2022

VesnaT Oct 28, 2022

VesnaT Oct 28, 2022

VesnaT Oct 28, 2022

		@@ -0,0 +1 @@
		language: en

[ENH] New language detection #874

[ENH] New language detection #874

Conversation

djukicn commented Jun 30, 2022 • edited by PrimozGodec Loading

Issue

Description of changes

Includes

Comments to the reviewer

codecov-commenter commented Jul 20, 2022 • edited Loading

Codecov Report

ajdapretnar commented Oct 7, 2022

ajdapretnar commented Oct 7, 2022

VesnaT left a comment • edited Loading

Choose a reason for hiding this comment

VesnaT Oct 28, 2022

Choose a reason for hiding this comment

VesnaT Oct 28, 2022

Choose a reason for hiding this comment

VesnaT Oct 28, 2022

Choose a reason for hiding this comment

VesnaT Oct 28, 2022

Choose a reason for hiding this comment

djukicn commented Jun 30, 2022 •

edited by PrimozGodec

Loading

codecov-commenter commented Jul 20, 2022 •

edited

Loading

VesnaT left a comment •

edited

Loading