Dataproc and Standalone Cluster Spark Job launcher #1022

khorshuheng · 2020-09-30T04:57:56Z

What this PR does / why we need it:
This PR depends on #1021. Two types of Spark Job launchers have been included: Dataproc cluster and standalone cluster. Launchers for Yarn cluster, Kubernetes, Amazon EMR will be implemented at a later stage, in separate PRs.

Users are not expected to use these launchers directly. Instead, Feask SDK will use this launchers to submit batch feature retrieval job. This will be implemented in another PR.

Which issue(s) this PR fixes:

Fixes #

Does this PR introduce a user-facing change?:

NONE

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

woop · 2020-10-09T08:34:11Z

sdk/python/feast/pyspark/historical_feature_retrieval_job.py

-            Configuration for the retrieval job, in json format. Sample configuration as follows:
+            Configuration for the retrieval job, in json format.
+        entity_df (DataFrame):
+            Optional. If provided, the entity will be used directly and conf["entity"] will be ignored.


What does entity will be used directly mean. Not clear why entity_df can be optional.

It's mainly to support the scenario below:
client.get_historical_features_df(["rating"], entity_df)

Where a user passed in a Spark dataframe / Pandas dataframe directly as entity, instead of using FileSource , BigQuerySource.

If we mandate user to always use FileSource or BigQuerySource, then we don't need this additional argument.

I dont understand why its part of this method. Doesnt the entity_df get uploaded prior to calling this method, meaning it should be configured through the conf?

woop · 2020-10-09T08:34:58Z

sdk/python/feast/pyspark/historical_feature_retrieval_job.py

+    spark = SparkSession.builder.getOrCreate()
+    parser = argparse.ArgumentParser(description="Retrieval job arguments")
+    parser.add_argument(
+        "config_json", type=str, help="Configuration in json string format"


So this is a json blob right?

Yes, it will be a json string, similar approach to the Spark jobs written by @pyalex . Originally i chose to use an actual config file, but it makes the job launcher much more complicated due to uploading and retrieval of configuration file.

In case of ingestion job there're several json argument and I suggest you can split your config as well. It's just easier to work with job that has separated parameters, that one big json.
So in ingestion-job there are
--feature-table - json with project, name, entities, features
--source separate source to be able to run custom source
I assume in historical retrieval you would also need
--entity-source

Here's example of arguments for ingestion job

--source {"kafka":{"bootstrapServers":"10.200.219.238:6668","topic":"supply-aggregator-sauron","classpath":"com.gojek.esb.aggregate.supply.AggregatedSupplyMessage","mapping":{},"timestampColumn":"event_timestamp"}} --feature-table {"name":"fs","project":"default","entities":[{"name":"customer","type":"STRING"}],"features":[{"name":"feature1","type":"INT32"}]}

I don't think that our jobs arguments will be completely the same. Historical job receives several feature tables for example. But at least we can try to keep it similar

--source separate source to be able to run custom source

@pyalex How would the custom source applies in the context of historical feature retrieval? Is it something like a user input to override the Feature Table source registered on Feast Core?

Or do you mean that the source of the feature tables should be it's own parameter (--source), rather than as a field under --feature-table ?

Just to throw in another idea, for feature_table and entity we could also pass the protobuf in JSON format. You'd still to have to manually parse it, but on the caller side it would be easier as we could replace custom serialization code with e.g. json.dumps(MessageToDict(feature_table.to_proto()))

@khorshuheng I just provided example of ingestion job args. It may be not very relevant to Historical directly. I'm just saying we can try to converge arguments format.

@oavdeev we can move towards protobuf-like format. But it's already can't be 100% compatible. FeatureTable, for example, has entities as strings. Whereas most jobs need type as well. So some massaging would still be required - mostly denormalization

I see, didn't realize that. Let's just go with bespoke json then like in the ingestion job

woop · 2020-10-09T10:01:08Z

sdk/python/feast/pyspark/launchers.py

+        bucket = client.get_bucket(self.staging_bucket)
+        blob_path = os.path.join(
+            self.remote_path,
+            "temp",


woop · 2020-10-09T10:04:45Z

sdk/python/feast/pyspark/launchers.py

+        self.region = region
+        self.job_client = dataproc_v1.JobControllerClient(
+            client_options={
+                "api_endpoint": "{}-dataproc.googleapis.com:443".format(region)


can we use f-strings please

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

khorshuheng · 2020-10-12T02:44:07Z

/test test-end-to-end

woop · 2020-10-12T04:58:35Z

sdk/python/feast/pyspark/historical_feature_retrieval_job.py

+    options: Dict[str, str] = {}
+
+
+class FeatureTableSource(NamedTuple):


I don't understand why we have two sources, nor why created_timestamp is needed for the FeatureTableSource

sdk/python/feast/pyspark/historical_feature_retrieval_job.py

…es for Feast SDK Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

… FileSource and BQSource Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

woop · 2020-10-12T11:20:59Z

sdk/python/feast/pyspark/historical_feature_retrieval_job.py

+
+class FeatureTableDataframe(NamedTuple):
+    """
+    Feature table dataframe with specification.


Can you add a comment here that describes this class. This comment says nearly nothing.

I dont understand why we need this class

This class contains information from both FeatureTable and Source, excluding the location of the feature table. In my original implementation, both timestamp columns and feature table specification are within the same level of configuration. But for the implementation in ingestion job, timestamp columns is in Source, whereas feature table specification is in FeatureTable.

I can see two choices here:

Eliminate FeatureTableDataframe, and pass both Source and FeatureTable into as_of_join and join_entity_to_feature_tables, eventhough the two methods don't exactly need to know the format and path of the Feature Table source.

Move timestamp columns to FeatureTable instead of Source, but that would mean the configuration is different from the Ingestion Job.

This class contains information from both FeatureTable and Source, excluding the location of the feature table. In my original implementation, both timestamp columns and feature table specification are within the same level of configuration. But for the implementation in ingestion job, timestamp columns is in Source, whereas feature table specification is in FeatureTable.

Ok, but it's not clear what this class is for. You're talking about your past implementations and implementation details but what problem are you trying to solve. This class is called FeatureTableDataframe, could it be replaced by using a FeatureTable and a DataFrame?

I can see two choices here:

Eliminate FeatureTableDataframe, and pass both Source and FeatureTable into as_of_join and join_entity_to_feature_tables, eventhough the two methods don't exactly need to know the format and path of the Feature Table source.

Move timestamp columns to FeatureTable instead of Source, but that would mean the configuration is different from the Ingestion Job.

You are presenting solutions but I dont know what problem you are trying to solve.

All class creation should be justified. So when I see a class like this I want to know why its essential complexity. Why can't it be a function?

Feature table class doesn't contain timestamp column and created timestamp column. If it does, then yes, I don't need this class and I can easily replace this with a FeatureTable and a Dataframe, which is my preferred solution. The only reason why I haven't done this is because of the previous discussion where we want to use the configuration to be similar to ingestion job.

The problem I am trying to solve here: I want neither as_of_join method nor join_entity_features method to have any argument that contains file format and file path information, so that means I can't pass Source as an argument. But timestamp column information is required. Without creating a new class, that means my function arguments for as of join methods, will need to have list of Feature Table, List or dictionary of event timestamp columns, and the dataframes. The new class acts as a container that store all these attributes together.

Ok, let me just come out with alternative solution that doesn't involve this new class and still keep the existing configuration, and see how it goes.

woop · 2020-10-12T11:25:49Z

sdk/python/feast/pyspark/historical_feature_retrieval_job.py

+    Source for an entity or feature dataframe.
+
+    Attributes:
+        timestamp_column (str): Column representing the event timestamp.


can we make this event_timestamp_column?

I prefer event_timestamp_column as well, though there are other instances in Feast SDK where timestamp_column is used instead (Datasource, BatchSource, for example).

I am happy to make the change though. Should i change it only for historical_feature_retrieval_job?

Lets just make the change here please.

woop · 2020-10-12T11:26:09Z

sdk/python/feast/pyspark/historical_feature_retrieval_job.py

+        timestamp_column (str): Column representing the event timestamp.
+        created_timestamp_column (str): Column representing the creation timestamp. Required
+            only if the source corresponds to a feature table.
+        mapping (Optional[Dict[str, str]]): If present, the source column will be renamed


should this be field_map?

I am fine with making this field_map, but currently Oleksi's ingestion job is using mapping instead. Which one should i use then?

Mapping seems a bit vague. The type is a map, so its already a mapping. @pyalex

Renamed to field_mapping instead because that's what being used in the protobuf and the other part of the Feast SDK right now.

woop · 2020-10-12T11:34:16Z

sdk/python/feast/pyspark/historical_feature_retrieval_job.py

+    path: str
+
+
+class EntityDataframe(NamedTuple):


I dont understand the point of this class.

woop · 2020-10-12T11:36:25Z

sdk/python/feast/pyspark/historical_feature_retrieval_job.py

+    Args:
+        spark (SparkSession): Spark session.
+        entity_source (Source): Entity data source.
+        feature_tables_sources (Source): List of feature tables data sources.


Typo on source

woop · 2020-10-12T11:36:43Z

sdk/python/feast/pyspark/historical_feature_retrieval_job.py

-        `max_age` is in seconds, and determines the lower bound of the timestamp of the retrieved feature.
-        If not specified, this would be unbounded.
+    Returns:
+        DataFrame: Join result.


is join result sufficient documentation?

woop · 2020-10-12T11:45:35Z

sdk/python/feast/pyspark/launchers.py

+        raise NotImplementedError
+
+
+class StandaloneCluster(JobLauncher):


Should this be StandaloneClusterJobLauncher?

woop · 2020-10-12T11:46:24Z

sdk/python/feast/pyspark/launchers.py

+            pyspark_script (str): Local file path to the pyspark script for historical feature
+                retrieval.
+            entity_source_conf (List[Dict]): Entity data source configuration.
+            feature_tables_sources_conf (Dict): List of feature tables data sources configurations.


Where have you documented where feature_tables_sources_conf will be used vs sources found in feature_tables_conf

Sources would not be part of feature_tables_conf , as per the design of configuration found here: https://github.com/feast-dev/feast/blob/master/spark/ingestion/src/main/scala/feast/ingestion/IngestionJobConfig.scala#L74-L79

Though, i understand that the above behaviour is actually inconsistent with the way we currently define our FeatureTable proto.

Should i revamp the configuration (for historical retrieval job) such that Source is part of Feature Table conf? In which case, it would not be necessary to have both feature_tables_conf and feature_tables_sources_conf.

Maybe we can tweak the format in a separate diff? Either way works, but above everything else i'd def prefer things to be consistent between this and @pyalex ingestion job.

oavdeev · 2020-10-12T12:19:18Z

sdk/python/feast/pyspark/historical_feature_retrieval_job.py

-        `options` is optional. If present, the options will be used when reading / writing the input / output.
+    Args:
+        spark (SparkSession): Spark session.
+        entity_source (Source): Entity data source.


I don't think "entity data source" is a concept that has been used before in the codebase, maybe worth adding an explanation what it is. Or maybe call it entity_df_source? It is somewhat confusing that "entities" is used interchangeably for the entity objects themselves and the dataframe for point-in-time joins.

oavdeev · 2020-10-12T12:25:15Z

sdk/python/feast/pyspark/launchers.py

+            "--master",
+            self.master_url,
+            "--name",
+            job_id,


Why is job_id externally configurable here? (as opposed to generating one inside JobLauncher)

Feast client will be the one generating the job_id, rather than user specified. As to why the job id is generated in Feast Client, rather than JobLauncher:

Prior to launching the job, JobLauncher will be doing some preparatory task, such as uploading Pandas dataframe to GCS / S3 (if user input is a pandas dataframe rather than a uri pointing to the file) , and define the output path based on Feast client configuration. In both cases, we can use job id as part of the GCS / S3 path name, which would be useful for tracking purposes.

That being said, i am fine with moving the job id generation within JobLauncher instead, if there is a good reason as to why it's better to have the job id generation within Job Launcher instead of Feast Client?

I see. I don't feel super strongly either way, but you could say that job id generation is a part of JobLauncher internal logic. That way the cloud-specific implementation would be able to use a specific jobid format for convenience. There might be cloud-specific limitations on job id length / charset that Feast Client is not aware of.

It may come in handy given that we don't plan to store jobs in a central database, they will have to be stored as some attribute or tag in dataproc/EMR job metadata itself.

Is this job_id the internal job_id or the external one? They might be different. The latter should be encapsulated, the former is internal to Feast and would be exposed to users. Or are we going to make them one and the same?

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

feast-ci-bot · 2020-10-13T04:20:41Z

@khorshuheng: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
test-end-to-end-batch	9aed5d8a57cb70c957441433fd6e577894ba1818	link	`/test test-end-to-end-batch`

Full PR test history

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

…configuration Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

feast-ci-bot · 2020-10-13T05:52:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: khorshuheng, woop

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [khorshuheng,woop]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

oavdeev · 2020-10-13T07:25:31Z

/lgtm

khorshuheng requested review from davidheryanto, pyalex, woop and zhilingc as code owners September 30, 2020 04:57

feast-ci-bot added do-not-merge/work-in-progress approved needs-kind size/XL labels Sep 30, 2020

khorshuheng force-pushed the spark-launcher branch from 36da298 to 9aed5d8 Compare September 30, 2020 05:05

khorshuheng force-pushed the spark-launcher branch from 9aed5d8 to 30f930a Compare October 9, 2020 08:00

feast-ci-bot added size/L and removed size/XL labels Oct 9, 2020

khorshuheng changed the title ~~(WIP) Dataproc and Standalone Cluster Spark Job launcher~~ Dataproc and Standalone Cluster Spark Job launcher Oct 9, 2020

feast-ci-bot removed the do-not-merge/work-in-progress label Oct 9, 2020

khorshuheng force-pushed the spark-launcher branch from 30f930a to 05ba197 Compare October 9, 2020 08:06

khorshuheng added the kind/feature New feature or request label Oct 9, 2020

feast-ci-bot removed the needs-kind label Oct 9, 2020

Add Dataproc and Standalone cluster pyspark retrieval job launchers

b185bc2

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

khorshuheng force-pushed the spark-launcher branch from 05ba197 to b185bc2 Compare October 9, 2020 08:10

woop reviewed Oct 9, 2020

View reviewed changes

feast-ci-bot added size/XXL and removed size/L labels Oct 12, 2020

khorshuheng force-pushed the spark-launcher branch from 05bb6aa to ea8b6b2 Compare October 12, 2020 02:00

Modify input argument format for pyspark job

f29f326

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

khorshuheng force-pushed the spark-launcher branch from ea8b6b2 to f29f326 Compare October 12, 2020 02:22

woop reviewed Oct 12, 2020

View reviewed changes

sdk/python/feast/pyspark/historical_feature_retrieval_job.py Show resolved Hide resolved

Use dictionary instead of class to avoid mandatory pyspark dependenci…

7c5025d

…es for Feast SDK Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

khorshuheng force-pushed the spark-launcher branch from 665f04c to 7c5025d Compare October 12, 2020 05:00

Consolidate EntitySource and FeatureTableSource, limit source type to…

a72772b

… FileSource and BQSource Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

woop reviewed Oct 12, 2020

View reviewed changes

oavdeev reviewed Oct 12, 2020

View reviewed changes

khorshuheng added 3 commits October 13, 2020 11:26

Remove EntityDataframe and FeatureDataframe

d635f3d

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

Improve variable naming

b897ee4

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

Rename timestamp_column to event_timestamp_column

a523748

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

Add documentation and example for historical feature retrieval input …

c0d5321

…configuration Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

woop approved these changes Oct 13, 2020

View reviewed changes

khorshuheng added 2 commits October 13, 2020 14:01

Rename field_map to field_mapping to be consistent with proto field name

db46c5c

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

Fix linting

f839c36

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

feast-ci-bot assigned oavdeev Oct 13, 2020

feast-ci-bot added the lgtm label Oct 13, 2020

feast-ci-bot merged commit ffaf8c5 into feast-dev:master Oct 13, 2020

		options: Dict[str, str] = {}


		class FeatureTableSource(NamedTuple):

		raise NotImplementedError


		class StandaloneCluster(JobLauncher):

Dataproc and Standalone Cluster Spark Job launcher #1022

Dataproc and Standalone Cluster Spark Job launcher #1022

Conversation

khorshuheng commented Sep 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pyalex Oct 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pyalex Oct 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khorshuheng commented Oct 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

woop Oct 12, 2020 • edited Loading

Choose a reason for hiding this comment

khorshuheng Oct 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khorshuheng Oct 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feast-ci-bot commented Oct 13, 2020 • edited Loading

feast-ci-bot commented Oct 13, 2020

oavdeev commented Oct 13, 2020

pyalex Oct 9, 2020 •

edited

Loading

pyalex Oct 9, 2020 •

edited

Loading

woop Oct 12, 2020 •

edited

Loading

khorshuheng Oct 12, 2020 •

edited

Loading

khorshuheng Oct 12, 2020 •

edited

Loading

feast-ci-bot commented Oct 13, 2020 •

edited

Loading