Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Widen dependencies #2906

Closed
chhabrakadabra opened this issue Jul 4, 2022 · 7 comments · Fixed by #2928
Closed

Widen dependencies #2906

chhabrakadabra opened this issue Jul 4, 2022 · 7 comments · Fixed by #2928
Labels
kind/feature New feature or request priority/p1

Comments

@chhabrakadabra
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

Feast often needs to be installed alongside other libraries. Currently Feast pins many of its dependencies to the minor version instead of using a broader range. This means that if other libraries also pin to narrow version ranges, dependencies just can't be resolved.

Describe the solution you'd like

Change setup.py to use wider ranges for various dependencies. I'm not sure how that changes your testing strategy. Do you have to do some amount of matrix testing for important dependencies?

Describe alternatives you've considered

Not sure.

Additional context

N/A

@chhabrakadabra chhabrakadabra added the kind/feature New feature or request label Jul 4, 2022
@felixwang9817
Copy link
Collaborator

@chhabrakadabra thanks for raising this! I think we should probably just loosen all == dependencies to allow for all minor versions up to the next major version unless there is some explicit reason not to, as we did in #2647. For example, we would change pygments==2.12.0 to pygments>=2.12.0,<3.

@kevjumba
Copy link
Collaborator

kevjumba commented Jul 5, 2022

@chhabrakadabra if you would like to open a pr to pin the dependencies that would b egreat!

@chhabrakadabra
Copy link
Collaborator Author

I tried to widen the dependencies, but I think I might need help from the feast maintainers. Specifically, upgrading the bigquery library was not trivial. Various integration tests fail. Here are my notes analyzing just one of them (sdk/python/tests/integration/offline_store/test_offline_write.py::test_reorder_columns[GCP:BigQuery:redis:python_fs:False:go_fs:False]):

  • The reorder columns test (and I imagine other tests too) are failing because of some types related changes introduced in google-cloud-bigquery 3.0. this and this
    • Notably, the types roundtrip is failing. This is what I think happens in the test:
      • A bunch of Pandas dataframes are created for the test itself. This includes the driver_hourly_stats. Notably, the event_timestamp is timezone-aware and the created column is not.
      • In previous versions of google-cloud-bigquery, both of them when written to BigQuery would have resulted in the type TIMESTAMP. But now the timezone unaware type results in DATETIME instead.
      • In BigQueryOfflineStore.offline_write_batch, we create a bigquery job-config that includes our expected schema. We calculate this schema as follows:
        • We start with the schema from the batch source. We fetch the table schema from the bigquery client itself. This should contain datetime and timestamp where appropriate.
        • We then convert this schema into feast ValueTypes. This is the common schema that feast uses to arbitrate between the types of various systems it interfaces with. This is done using the bq_to_feast_value_type function. Notably, both DATETIME and TIMESTAMP get converted to ValueType.UNIX_TIMESTAMP. This is a problem. The common value types in Feast don't have separate types for datetime and timestamp.
        • We then convert this ValueType based schema into a [[PyArrow]] schema using a feast_value_type_to_pa function. The UNIX_TIMESTAMP fields both become timezone-unaware timestamps.
        • We then use this PyArrow schema to assert against the incoming data. If the column names are the same, we might even attempt to cast the existing data into the expected pyarrow schema.
        • We then convert the pyarrow schema into a BigQuery schema using a arrow_schema_to_bq_schema function in feast/infra/offline_stores/bigquery.py. This uses a ARROW_SCALAR_IDS_TO_BQ dict from the bigquery library itself, which converts both of those columns (that are now timezone unaware timestamps) to TIMESTAMP. This is obviously not the schema of the bigquery table, which breaks and complains.
    • I think the fix might be to introduce a datetime type into the ValueTypes system.

To recap, a solution I can see to this problem is to add another type to the type system to catch the differences between timezone aware and timezone unaware timestamps. Alternatively, we can make all incoming timezone-unaware timestamps timezone-aware by assuming UTC. I'm not 100% sure what the maintainers prefer.

@kevjumba
Copy link
Collaborator

@chhabrakadabra Hey Abhin, one question, I see that you're pinning a bunch of versions to allow a major version upgrade. Is this necessary? E.g currently bigquery version is 2 and it would be easier if we pin it below 3?

Do you guys have specific needs for the next major version

@adchia
Copy link
Collaborator

adchia commented Jul 14, 2022

We had a quick discussion. Seems like the preference is to throw errors if we see timezone naive timestamps and only use the TIMESTAMP

We can make this nicer in the future too with data source validation during feast apply

@chhabrakadabra
Copy link
Collaborator Author

@chhabrakadabra Hey Abhin, one question, I see that you're pinning a bunch of versions to allow a major version upgrade. Is this necessary? E.g currently bigquery version is 2 and it would be easier if we pin it below 3?

Do you guys have specific needs for the next major version

Yeah, I think this is a sore spot for integrating with other libraries. The core idea behind this ticket is that Feast is being installed alongside many other libraries by data scientists that want to fetch data from the feature store. Many of these libraries have moved on to bigquery version 3 and even set a lower bound of 3.0.0. So Feast not upgrading the bigquery client library leads to people having to use older versions of these libraries.

@kevjumba
Copy link
Collaborator

@chhabrakadabra I see thanks for clarifying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request priority/p1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants