Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing the transformer behaviour on sparse features #122

Merged
merged 13 commits into from
Apr 2, 2021

Conversation

Craigacp
Copy link
Member

@Craigacp Craigacp commented Mar 27, 2021

Description

This PR adds implementations of observeSparse to all the transformers (apart from IDFTransformation which already had it). As a consequence to preserve the 4.0 behaviour the transformation fitting methods grow a new argument which turns on or off the use of the observeSparse method. It also switches DoubleFieldProcessor over so that it always emits values if they are parsable doubles. Before it would elide zero values, but this makes it much harder to implement transformations which should touch every value because they are numerical (rather than categorical encoded as a double).

One downside of this implementation is that it's not possible to change the behaviour of observeSparse on a per feature basis (due to it being a tricky breaking API change), which makes it difficult to apply an IDFTransformation to text features while simultaneously transforming features which should ignore the implicit zeros. If this behaviour is required then two TransformationMaps can be applied in sequence to a Dataset. We'll note this in the release notes for 4.1.

Motivation

The observeSparse change is so the transformers actually have the documented behaviour, and then that rippled into adding new overloads for the transformation fitting methods. The DoubleFieldProcessor change is because it's tricky to correctly transform numerical data without it.

@Craigacp Craigacp requested a review from JackSullivan March 27, 2021 01:32
Co-authored-by: Jack Sullivan <jack.t.sullivan@oracle.com>
Copy link
Member

@JackSullivan JackSullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the changes that we discussed on Friday, but looking over it now, the implicit zeroes vs. densify behavior interactions aren't really explicitly characterized anywhere. I think it would be a good idea to have one place that clearly explains how all four possible settings work that can referred to for more detail in all the various constructors/methods where the interaction is important.

@Craigacp
Copy link
Member Author

Ok, sure. I'll add something to the package-info in org.tribuo.transform.

Copy link
Member

@JackSullivan JackSullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nitpicky word-choice changes, but otherwise it looks good to me.

Craigacp and others added 4 commits April 2, 2021 13:54
Co-authored-by: Jack Sullivan <jack.t.sullivan@oracle.com>
Co-authored-by: Jack Sullivan <jack.t.sullivan@oracle.com>
Co-authored-by: Jack Sullivan <jack.t.sullivan@oracle.com>
Co-authored-by: Jack Sullivan <jack.t.sullivan@oracle.com>
@Craigacp
Copy link
Member Author

Craigacp commented Apr 2, 2021

Thanks for the updates.

@Craigacp Craigacp merged commit ba1bd30 into main Apr 2, 2021
@Craigacp Craigacp deleted the transform-sparseness-fixes branch April 2, 2021 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants