Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Concatenate: Fix wrong merging of categorical features #4425

Merged
merged 4 commits into from
Feb 28, 2020

Conversation

janezd
Copy link
Contributor

@janezd janezd commented Feb 15, 2020

Issue

Fixes #4406. Also fixes part of #4382.

The problem was that when merging domains, attributes from the first domain are used and other attributes values' that don't appear in the first table are turned into nan's.

This fix applies only when "primary table" is not present. When it is present, #4422 suffices.

Description of changes

When filtering for unique attributes, two (or more) attributes with different values are replaced by a new attribute.

Includes
  • Code changes
  • Tests

@codecov
Copy link

codecov bot commented Feb 15, 2020

Codecov Report

Merging #4425 into master will decrease coverage by 4.31%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #4425      +/-   ##
==========================================
- Coverage   87.46%   83.15%   -4.32%     
==========================================
  Files         405      268     -137     
  Lines       74135    53962   -20173     
==========================================
- Hits        64841    44870   -19971     
+ Misses       9294     9092     -202

compute_value=compute_value, name=name,
number_of_decimals=kwargs.pop("number_of_decimals",
self.number_of_decimals),
**kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enables creating combinations of self._number_of_decimals, self.adjust_decimals and self._format_str that are forbidden by number_of_decimals setter.

Speaking of self.adjust_decimals. Why do we need that?

Copy link
Contributor Author

@janezd janezd Feb 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. This should go through setter. I changed it to:

    def copy(self, compute_value=None, *, name=None, **kwargs):
        number_of_decimals = kwargs.pop("number_of_decimals", None)
        var = super().copy(compute_value=compute_value, name=name, **kwargs)
        if number_of_decimals is not None:
            var.number_of_decimals = number_of_decimals
        else:
            var._number_of_decimals = self._number_of_decimals
            var.adjust_decimals = self.adjust_decimals
            var.format_str = self._format_str
        return var

adjust_decimals is a flag that tells whether the number of decimals was fixed (e.g. through setter) or is it being adjusted to the largest number of decimals (while reading the file). See function val_from_str_add_cont.

super().__init__(name, compute_value, sparse=sparse)
self.values = values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather make self.values a property and check self._value_index - self._value consistency in its setter. It's too easy to obtain inconsistent state with this api.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. But this commit is from #4422, on which this PR was based. #4422 was already merged with this code.

I agree that values should be a property and it should also be a tuple, not a list. But if I remember correctly, I already tried to change this once but ran into problems. At any rate, it's not in this PR, but if you'd like to try it, I support it.


def add_value(self, s):
""" Add a value `s` to the list of values.
"""
if not isinstance(s, str):
raise TypeError("values of DiscreteVariables must be strings")
self._value_index[s] = len(self.values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a public method, I'd add a check if s not in self._value_index, before inserting a new value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. I agree, but this is from #4422, which is already merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now implemented this in #4450.

Plus fixing some lint issues for a good measure.

(table.domain for table in tables))
domains = [table.domain for table in tables]
oper = set.union if self.merge_type == OWConcatenate.MergeUnion \
else set.intersection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why passing oper parameter? It can be set inside self.merge_domains.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inertia. :)

@janezd
Copy link
Contributor Author

janezd commented Feb 26, 2020

@VesnaT, I added a test that tries different orders of adding signals: 98b0e15#diff-e8327cf214d283cae48e36d934820db9R351.

Can you pull the PR and run the test? If it passes: can you modify it to replicate what you're doing in canvas, so that it will fail?

@VesnaT VesnaT merged commit 0ec94d4 into biolab:master Feb 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Concatenate categorical
2 participants