Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming Ingestion Pipeline with Spark #1027

Merged
merged 9 commits into from
Oct 13, 2020

Conversation

pyalex
Copy link
Collaborator

@pyalex pyalex commented Oct 7, 2020

What this PR does / why we need it:

This PR replaces current (beam) Ingestion Job with spark version. However, several changes are introduced:

  1. Input to streaming pipeline is arbitrary* protobuf. Message class must be provided through FeatureTable configuration.
  2. Ingestion job stores only to redis and writes deadletters as parquet files.

Currently there are two ways to make protobuf class available for job in runtime:

  1. Compile proto & pack it into jar and add jar to list of files on Spark-submit.
    This option has a limitation that compiled class must be linked to shadowed protobuf library com.google.protobuf.vendor
    since in our job we had to shadow protobuf due to conflicts with spark dependencies.
  2. Use proto registry (external service) as source of proto descriptors. As starter we gonna support https://github.com/gojekfarm/stencil But we would need to support more common implementation of proto registry.

*Arbitrary protobuf has some limitations. oneof, google.protobuf.Any are not supported. Thus FeatureRow can't be input to new streaming pipeline.

Which issue(s) this PR fixes:

Fixes #

Does this PR introduce a user-facing change?:

Streaming ingestion job expects arbitrary protobuf as input with some limitations to not supporting `oneof` since spark's dataframe doesn't support it

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>
Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>
Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>
Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>
Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>
Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>
Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>
@pyalex pyalex changed the title WIP Streaming Ingestion Pipeline with Spark Streaming Ingestion Pipeline with Spark Oct 12, 2020
@pyalex pyalex added the kind/feature New feature or request label Oct 12, 2020
@feast-ci-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: oavdeev, pyalex

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>
Signed-off-by: Oleksii Moskalenko <moskalenko.alexey@gmail.com>
@woop
Copy link
Member

woop commented Oct 13, 2020

/lgtm

@feast-ci-bot feast-ci-bot merged commit 2cd019c into feast-dev:master Oct 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants