Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_historical_features fails with dask error for file offline store #2865

Closed
elshize opened this issue Jun 27, 2022 · 6 comments · Fixed by #2965
Closed

get_historical_features fails with dask error for file offline store #2865

elshize opened this issue Jun 27, 2022 · 6 comments · Fixed by #2965

Comments

@elshize
Copy link

elshize commented Jun 27, 2022

Expected Behavior

feature_store.get_historical_features(df, features=fs_columns).to_df()

where feature_store is a feature store with file offline store and fs_columns is a list of column names, and df is a Pandas data frame, should work.

Current Behavior

It currently raises an error inside of dask:

E           NotImplementedError: dd.DataFrame.apply only supports axis=1
E             Try: df.apply(func, axis=1)

Stacktrace:

../../.cache/pypoetry/virtualenvs/w3-search-letor-SCEBvDm1-py3.9/lib/python3.9/site-packages/feast/infra/offline_stores/offline_store.py:81: in to_df
    features_df = self._to_df_internal()
../../.cache/pypoetry/virtualenvs/w3-search-letor-SCEBvDm1-py3.9/lib/python3.9/site-packages/feast/usage.py:280: in wrapper
    raise exc.with_traceback(traceback)
../../.cache/pypoetry/virtualenvs/w3-search-letor-SCEBvDm1-py3.9/lib/python3.9/site-packages/feast/usage.py:269: in wrapper
    return func(*args, **kwargs)
../../.cache/pypoetry/virtualenvs/w3-search-letor-SCEBvDm1-py3.9/lib/python3.9/site-packages/feast/infra/offline_stores/file.py:75: in _to_df_internal
    df = self.evaluation_function().compute()
../../.cache/pypoetry/virtualenvs/w3-search-letor-SCEBvDm1-py3.9/lib/python3.9/site-packages/feast/infra/offline_stores/file.py:231: in evaluate_historical_retrieval
    df_to_join = _normalize_timestamp(
../../.cache/pypoetry/virtualenvs/w3-search-letor-SCEBvDm1-py3.9/lib/python3.9/site-packages/feast/infra/offline_stores/file.py:530: in _normalize_timestamp
    df_to_join[timestamp_field] = df_to_join[timestamp_field].apply(

Steps to reproduce

Here is my feature store definition:

from feast import FeatureStore, RepoConfig, FileSource, FeatureView, ValueType, Entity, Feature
from feast.infra.offline_stores.file import FileOfflineStoreConfig
from google.protobuf.duration_pb2 import Duration

source_path = tmp_path / "source.parquet"
timestamp = datetime.datetime(year=2022, month=4, day=29, tzinfo=datetime.timezone.utc)
df = pd.DataFrame(
    {
        "entity": [0, 1, 2, 3, 4],
        "f1": [1.0, 1.1, 1.2, 1.3, 1.4],
        "f2": ["a", "b", "c", "d", "e"],
        "timestamp": [
            timestamp,
            # this one should not be fetched as it is too far into the past
            timestamp - datetime.timedelta(days=2),
            timestamp,
            timestamp,
            timestamp,
        ],
    }
)
df.to_parquet(source_path)
source = FileSource(
    path=str(source_path),
    event_timestamp_column="timestamp",
    created_timestamp_column="timestamp",
)
entity = Entity(
    name="entity",
    value_type=ValueType.INT64,
    description="Entity",
)

view = FeatureView(
    name="view",
    entities=["entity"],
    ttl=Duration(seconds=86400 * 1),
    features=[
        Feature(name="f1", dtype=ValueType.FLOAT),
        Feature(name="f2", dtype=ValueType.STRING),
    ],
    online=True,
    batch_source=source,
    tags={},
)

config = RepoConfig(
    registry=str(tmp_path / "registry.db"),
    project="hello",
    provider="local",
    offline_store=FileOfflineStoreConfig(),
)

store = FeatureStore(config=config)
store.apply([entity, view])

expected = pd.DataFrame(
    {
        "event_timestamp": timestamp,
        "entity": [0, 1, 2, 3, 5],
        "someval": [0.0, 0.1, 0.2, 0.3, 0.5],
        "f1": [1.0, np.nan, 1.2, 1.3, np.nan],
        "f2": ["a", np.nan, "c", "d", np.nan],
    }
)

Specifications

  • Version: 0.21.3
  • Platform: Linux
  • Subsystem: Python 3.9

Possible Solution

This works fine in at least version 0.18.1, but I think it fails for any >0.20

It might have something to do with adding Dask requirement, maybe the version is insufficient? I used to use 2022.2 before, but the requirement is now for 2022.1.1. But this is just a guess, really.

@elshize
Copy link
Author

elshize commented Jun 27, 2022

In fact, the last version that works is 0.18.1

@elshize
Copy link
Author

elshize commented Jun 29, 2022

The problem was:

source = FileSource(
    path=str(source_path),
    event_timestamp_column="timestamp",
    created_timestamp_column="timestamp",
)

When both timestamp columns are the same, it breaks. Once I changed to:

    source = FileSource(
        path=str(source_path),
        timestamp_field="timestamp",
    )

it's no longer an issue.

I will leave this ticket open, and let the maintainers decide if this is expected behavior or if there's something to be done to fix it or add some explicit asserts.

@achals
Copy link
Member

achals commented Jun 30, 2022

Thanks for the details @elshize - this definitely smells like a bug we need to fix!

@achals
Copy link
Member

achals commented Jul 21, 2022

I was unable to reproduce this issue locally - for posterity this is my setup:

$ feast version
Feast SDK Version: "feast 0.22.1"
$ pip list
Package                  Version
------------------------ ---------
absl-py                  1.2.0
anyio                    3.6.1
appdirs                  1.4.4
attrs                    21.4.0
bowler                   0.9.0
cachetools               5.2.0
certifi                  2022.6.15
charset-normalizer       2.1.0
click                    8.0.1
cloudpickle              2.1.0
colorama                 0.4.5
dask                     2022.1.1
dill                     0.3.5.1
fastapi                  0.79.0
fastavro                 1.5.3
feast                    0.22.1
fissix                   21.11.13
fsspec                   2022.5.0
google-api-core          2.8.2
google-auth              2.9.1
googleapis-common-protos 1.56.4
greenlet                 1.1.2
grpcio                   1.47.0
grpcio-reflection        1.47.0
h11                      0.13.0
httptools                0.4.0
idna                     3.3
Jinja2                   3.1.2
jsonschema               4.7.2
locket                   1.0.0
MarkupSafe               2.1.1
mmh3                     3.0.0
moreorless               0.4.0
mypy                     0.971
mypy-extensions          0.4.3
numpy                    1.23.1
packaging                21.3
pandas                   1.4.3
pandavro                 1.5.2
partd                    1.2.0
pip                      22.0.4
proto-plus               1.20.6
protobuf                 3.20.1
pyarrow                  6.0.1
pyasn1                   0.4.8
pyasn1-modules           0.2.8
pydantic                 1.9.1
Pygments                 2.12.0
pyparsing                3.0.9
pyrsistent               0.18.1
python-dateutil          2.8.2
python-dotenv            0.20.0
pytz                     2022.1
PyYAML                   6.0
requests                 2.28.1
rsa                      4.9
setuptools               58.1.0
six                      1.16.0
sniffio                  1.2.0
SQLAlchemy               1.4.39
sqlalchemy2-stubs        0.0.2a24
starlette                0.19.1
tabulate                 0.8.10
tenacity                 8.0.1
tensorflow-metadata      1.9.0
toml                     0.10.2
tomli                    2.0.1
toolz                    0.12.0
tqdm                     4.64.0
typeguard                2.13.3
typing_extensions        4.3.0
urllib3                  1.26.10
uvicorn                  0.18.2
uvloop                   0.16.0
volatile                 2.1.0
watchfiles               0.15.0
websockets               10.3
$ python --version
Python 3.9.11

@elshize can you see if this is still an issue for you and reopen this if that's the case?

@achals achals closed this as completed Jul 21, 2022
@achals achals closed this as not planned Won't fix, can't repro, duplicate, stale Jul 21, 2022
@felixwang9817
Copy link
Collaborator

actually I was able to repro this; the source of the issue was reusing the same column for timestamp_field and created_timestamp_column, #2965 should solve this!

@elshize
Copy link
Author

elshize commented Jul 21, 2022

Yes, the problem was reusing the column. I shared that when I earlier in the comment, sorry if it wasn't entirely clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants