Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FeatureHasher should have an option to not hash the values #309

Merged
merged 2 commits into from
Dec 21, 2022

Conversation

Craigacp
Copy link
Member

@Craigacp Craigacp commented Dec 19, 2022

Description

FeatureHasher now has an option to preserve the values of the features that it hashes, and this option is available through TokenPipeline. Adds a couple of tests for the new behaviour.

This changes the default behaviour of TokenPipeline with hashing, and adds an option to the constructor to revert to the old behaviour. The old behaviour didn't make much sense, but this allows people to preserve their computations. Note that reproducing old models without this fix will turn on the new behaviour as it is defaulted to on, and so must be overridden when reproducing those models.

Motivation

A term counting TokenPipeline should never return negative values for the term counts, but as FeatureHasher hashes the feature values into {-1,1} this could cause the hashed dimensions to aggregate to a negative value which causes problems elsewhere and semantically isn't term counting.

Fixes #307.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Dec 19, 2022
jhalexand
jhalexand previously approved these changes Dec 21, 2022
Copy link
Member

@jhalexand jhalexand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, with the minor nit that you document/describe (accurately) the value hash seed using the phrasing "value hash" but name the member variables and static final members using "hash value". I think the former is clearer as "hash value" can easily be read as a more generic term. I'm still approving since this is a very minor nit, but I'll not merge in case you want to change this first.

@Craigacp Craigacp added the squash-commits Squash the commits when merging this PR label Dec 21, 2022
Copy link
Member

@jhalexand jhalexand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@Craigacp Craigacp merged commit 4d8b58f into main Dec 21, 2022
@Craigacp Craigacp deleted the feature-hasher-fix branch December 21, 2022 16:32
Craigacp added a commit that referenced this pull request Dec 21, 2022
* FeatureHasher should have an option to not hash the values.

* Renaming DEFAULT_HASH_VALUE_SEED to DEFAULT_VALUE_HASH_SEED, and adding validation for dimension.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement. squash-commits Squash the commits when merging this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Negative values for term counts when using feature hashing
2 participants