Fixing the transformer behaviour on sparse features #122

Craigacp · 2021-03-27T01:32:46Z

Description

This PR adds implementations of observeSparse to all the transformers (apart from IDFTransformation which already had it). As a consequence to preserve the 4.0 behaviour the transformation fitting methods grow a new argument which turns on or off the use of the observeSparse method. It also switches DoubleFieldProcessor over so that it always emits values if they are parsable doubles. Before it would elide zero values, but this makes it much harder to implement transformations which should touch every value because they are numerical (rather than categorical encoded as a double).

One downside of this implementation is that it's not possible to change the behaviour of observeSparse on a per feature basis (due to it being a tricky breaking API change), which makes it difficult to apply an IDFTransformation to text features while simultaneously transforming features which should ignore the implicit zeros. If this behaviour is required then two TransformationMaps can be applied in sequence to a Dataset. We'll note this in the release notes for 4.1.

Motivation

The observeSparse change is so the transformers actually have the documented behaviour, and then that rippled into adding new overloads for the transformation fitting methods. The DoubleFieldProcessor change is because it's tricky to correctly transform numerical data without it.

…iner and Dataset.createTransformers to expose this behaviour.

…p more immutable as it should be.

…a provenance serialization issue in IDFTransformation.

Core/src/main/java/org/tribuo/transform/TransformTrainer.java

Core/src/main/java/org/tribuo/transform/transformations/MeanStdDevTransformation.java

Co-authored-by: Jack Sullivan <jack.t.sullivan@oracle.com>

JackSullivan

This has the changes that we discussed on Friday, but looking over it now, the implicit zeroes vs. densify behavior interactions aren't really explicitly characterized anywhere. I think it would be a good idea to have one place that clearly explains how all four possible settings work that can referred to for more detail in all the various constructors/methods where the interaction is important.

Craigacp · 2021-03-29T14:36:30Z

Ok, sure. I'll add something to the package-info in org.tribuo.transform.

…res.

Core/src/main/java/org/tribuo/transform/package-info.java

JackSullivan

Some nitpicky word-choice changes, but otherwise it looks good to me.

Co-authored-by: Jack Sullivan <jack.t.sullivan@oracle.com>

Craigacp · 2021-04-02T17:55:17Z

Thanks for the updates.

Craigacp added 5 commits March 3, 2021 20:54

Adding implementations of observeSparse, and refactoring TransformTra…

3043ac7

…iner and Dataset.createTransformers to expose this behaviour.

Exposing the TransformerMap on TransformedModel. Making TransformerMa…

8c5cfd6

…p more immutable as it should be.

Switching DoubleFieldProcessor over so it emits zero valued features.

e11b333

Deprecating TransformStatistics.observeSparse as it's unused. Fixing …

c84e336

…a provenance serialization issue in IDFTransformation.

Improving the transformation documentation.

6efaea7

Craigacp requested a review from JackSullivan March 27, 2021 01:32

JackSullivan reviewed Mar 29, 2021

View reviewed changes

Core/src/main/java/org/tribuo/transform/TransformTrainer.java Outdated Show resolved Hide resolved

JackSullivan reviewed Mar 29, 2021

View reviewed changes

Core/src/main/java/org/tribuo/transform/transformations/MeanStdDevTransformation.java Show resolved Hide resolved

Update Core/src/main/java/org/tribuo/transform/TransformTrainer.java

d601c04

Co-authored-by: Jack Sullivan <jack.t.sullivan@oracle.com>

JackSullivan reviewed Mar 29, 2021

View reviewed changes

Craigacp added 3 commits April 2, 2021 11:48

Improving the javadoc discussing densify and includeImplicitZeroFeatu…

e1d0563

…res.

More changes to transform docs.

f2eb7f3

More updates to transform package docs.

3f25a3f

JackSullivan reviewed Apr 2, 2021

View reviewed changes

Core/src/main/java/org/tribuo/transform/package-info.java Outdated Show resolved Hide resolved

JackSullivan reviewed Apr 2, 2021

View reviewed changes

Core/src/main/java/org/tribuo/transform/package-info.java Outdated Show resolved Hide resolved

JackSullivan reviewed Apr 2, 2021

View reviewed changes

Core/src/main/java/org/tribuo/transform/package-info.java Outdated Show resolved Hide resolved

JackSullivan reviewed Apr 2, 2021

View reviewed changes

Core/src/main/java/org/tribuo/transform/package-info.java Outdated Show resolved Hide resolved

JackSullivan requested changes Apr 2, 2021

View reviewed changes

Craigacp and others added 4 commits April 2, 2021 13:54

Update Core/src/main/java/org/tribuo/transform/package-info.java

98ebef2

Co-authored-by: Jack Sullivan <jack.t.sullivan@oracle.com>

Update Core/src/main/java/org/tribuo/transform/package-info.java

6fe0ce9

Co-authored-by: Jack Sullivan <jack.t.sullivan@oracle.com>

Update Core/src/main/java/org/tribuo/transform/package-info.java

5e9368b

Co-authored-by: Jack Sullivan <jack.t.sullivan@oracle.com>

Update Core/src/main/java/org/tribuo/transform/package-info.java

58db418

Co-authored-by: Jack Sullivan <jack.t.sullivan@oracle.com>

JackSullivan approved these changes Apr 2, 2021

View reviewed changes

Craigacp merged commit ba1bd30 into main Apr 2, 2021

Craigacp deleted the transform-sparseness-fixes branch April 2, 2021 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing the transformer behaviour on sparse features #122

Fixing the transformer behaviour on sparse features #122

Craigacp commented Mar 27, 2021 •

edited

Loading

JackSullivan left a comment

Craigacp commented Mar 29, 2021

JackSullivan left a comment

Craigacp commented Apr 2, 2021

Fixing the transformer behaviour on sparse features #122

Fixing the transformer behaviour on sparse features #122

Conversation

Craigacp commented Mar 27, 2021 • edited Loading

Description

Motivation

JackSullivan left a comment

Choose a reason for hiding this comment

Craigacp commented Mar 29, 2021

JackSullivan left a comment

Choose a reason for hiding this comment

Craigacp commented Apr 2, 2021

Craigacp commented Mar 27, 2021 •

edited

Loading