Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Policy Recommendation Spark job and image #16

Merged
merged 2 commits into from
Jun 3, 2022

Conversation

dreamtalen
Copy link
Contributor

@dreamtalen dreamtalen commented May 5, 2022

In this PR, we add the policy recommendation spark job and image into the Theia repo.
Previously policy recommendation spark job has got several reviews on the Antrea repo at antrea-io/antrea#3064. I'm highlighting these changes compared with that closed PR:

  1. Add removing auto-generated Pod labels feature to merge recommended policies.
  2. Add the capability of recommending the toService ANPs for Pod-to-Svc flows.
  3. Make corresponding changes to work with Antctl CLI on the Flow Aggregator side.

Also, this is only the first PR for the Policy Recommendation feature. I will create subsequent PRs including Documentation, Antctl CLI, unit tests, and e2e tests, to complete this new feature.

Copy link
Contributor

@salv-orlando salv-orlando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review

@@ -0,0 +1,19 @@
FROM gcr.io/spark-operator/spark-py:v3.1.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo for a future PR: Consider whether this image can be moved in the antrea dockerhub as we did for the other 3rd party images used by Theia

Copy link
Contributor Author

@dreamtalen dreamtalen May 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, yes we could tag this image and push to our docker hub too.

plugins/policy-recommendation/antrea_crd.py Outdated Show resolved Hide resolved
"""Returns the model properties as a dict"""
result = {}

for attr, _ in six.iteritems(self.attribute_types):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance here you can just use .items because you don't have to worry about python2 compatibility.
In case the image defaults to python2, you can change the /usr/bin/python symlink or explicitly trigger the job with python3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, changed.

plugins/policy-recommendation/antrea_crd.py Show resolved Hide resolved
print(help_message)
sys.exit(2)
for opt, arg in opts:
if opt in ("-h", "--help"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd check for -h before entering the loop, a user my specify also other options, we don't want to parse them if -h is specified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, thanks!

plugins/policy-recommendation/policy_recommendation_job.py Outdated Show resolved Hide resolved
plugins/policy-recommendation/policy_recommendation_job.py Outdated Show resolved Hide resolved
@dreamtalen dreamtalen force-pushed the policy-reco branch 2 times, most recently from d5b1586 to 5670828 Compare May 22, 2022 22:41
# Select user trusted denied flows when unprotected equals False
sql_query += " WHERE trusted == 1"
if start_time:
sql_query += " AND flowEndSeconds >= '{}'".format(start_time)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UX question: the condition above captures flows that were completed after the requested start time.
In the case of start_time, would it make sense to instead capture the flows that were already started at start_time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, changed it to check flowStartSeconds instead.

plugins/policy-recommendation/policy_recommendation_job.py Outdated Show resolved Hide resolved
plugins/policy-recommendation/policy_recommendation_job.py Outdated Show resolved Hide resolved
plugins/policy-recommendation/policy_recommendation_job.py Outdated Show resolved Hide resolved
)
else:
print("Warning: egress tuple {} has wrong format".format(egress))
return ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be considered an error? If so, should we fail the job instead of returning an empty string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to me, marked this as a fatal error, and stopped the spark job immediately.

def recommend_antrea_policies(flows_df, option=1, deny_rules=True, to_services=True):
ingress_rdd = flows_df.filter(flows_df.flowType != "pod_to_external")\
.rdd.map(map_flow_to_ingress)\
.reduceByKey(lambda a, b: (a[0]+PEER_DELIMITER+b[0], ""))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(to no address in this PR) you can consider using a NamedTuple for src and dest so instead of referring by item 0 and item 1 you can refer to them as "src" and "dest"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PySpark, I think to achieve a 'namedTuple' like data structure I need to change the current RDD to the Dataframe type. This will involve lots of changes in the computation code. Could we mark this as a TODO for now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely it can be TODO, Ignore it for this PR.

The namedtuple would actually just be some syntactic sugar, where you access item in a tuple as if you were accessing an object.

I would not think this requires using a DataFrame, but if that's the case, it's surely not worth the effort.

.option("password", os.getenv("CH_PASSWORD")) \
.option("dbtable", table_name) \
.save()
return recommendation_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanjunz97 do you think we need to update the clickhouse monitor to periodically clean up recommendation results as well? Do you think perhaps we might need to define an expiration time for results, as perhaps it's not ok to bluntly delete reco results when memory exceeds threshold.

In any case, I am ok not supporting periodic collelction of old reco results in Theia's first release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we do not expect the recommendation results may occupy too many spaces, I think expiration time might be more reasonable comparing to cleaning up by the monitor.

But I'm not sure what expiration time should be chosen. I think a recommendation policy might be useful for a long time. Maybe it is more reasonable to delete them only when users trigger a deletion task from the UI?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yanjunz97, that's valuable feedback. Nothing we need to address here, but we surely need a mechanism to handle lifecycle of policy recommendation results.

@salv-orlando
Copy link
Contributor

It might also make sense of using the python logging library instead of printing to stdout.
It should be fairly easy to introduce logging in this job.

@dreamtalen
Copy link
Contributor Author

It might also make sense of using the python logging library instead of printing to stdout. It should be fairly easy to introduce logging in this job.

Sure, I added the code to use spark logger to replace the print statement.

Signed-off-by: Yongming Ding <dyongming@vmware.com>
@dreamtalen dreamtalen force-pushed the policy-reco branch 2 times, most recently from ff8ef3e to 98ef479 Compare June 1, 2022 00:21
]

spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("info")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make the log level configrable? It may help for live debug purpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's what @dreamtalen reckons. From what I gather this log will only emit what we are logging in this job, and - obviously - we are not logging anything at debug level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank Salvatore help me answer this question.
Yes, I hard-coded log level to info because I only added logs at info, warning, and error level. Also tried changing to "debug" mode and I could see lots of debug logs automatically generated by Spark.

Comment on lines +752 to +761
-s, --start_time=None: The start time of the flow records considered for the policy recommendation.
Format is YYYY-MM-DD hh:mm:ss in UTC timezone. Default value is None, which means no limit of the start time of flow records.
-e, --end_time=None: The end time of the flow records considered for the policy recommendation.
Format is YYYY-MM-DD hh:mm:ss in UTC timezone. Default value is None, which means no limit of the end time of flow records.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we may do not have proper flow records in DB, maybe we need a warn message to indicate this case?

Copy link
Contributor

@salv-orlando salv-orlando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks pretty much good to me.
There are a few pending questions from Ziyou, waiting on approval.

plugins/policy-recommendation/antrea_crd.py Show resolved Hide resolved
plugins/policy-recommendation/antrea_crd.py Show resolved Hide resolved
]

spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("info")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's what @dreamtalen reckons. From what I gather this log will only emit what we are logging in this job, and - obviously - we are not logging anything at debug level.

def recommend_antrea_policies(flows_df, option=1, deny_rules=True, to_services=True):
ingress_rdd = flows_df.filter(flows_df.flowType != "pod_to_external")\
.rdd.map(map_flow_to_ingress)\
.reduceByKey(lambda a, b: (a[0]+PEER_DELIMITER+b[0], ""))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely it can be TODO, Ignore it for this PR.

The namedtuple would actually just be some syntactic sugar, where you access item in a tuple as if you were accessing an object.

I would not think this requires using a DataFrame, but if that's the case, it's surely not worth the effort.

.option("password", os.getenv("CH_PASSWORD")) \
.option("dbtable", table_name) \
.save()
return recommendation_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yanjunz97, that's valuable feedback. Nothing we need to address here, but we surely need a mechanism to handle lifecycle of policy recommendation results.

@dreamtalen
Copy link
Contributor Author

dreamtalen commented Jun 1, 2022

It might also make sense of using the python logging library instead of printing to stdout. It should be fairly easy to introduce logging in this job.

Sure, I added the code to use spark logger to replace the print statement.

Update regarding logs: I found the spark logger only works on the driver and mapped functions running on the executors will meet an error: SparkContext can only be created/accessed/used on the driver. Also, the logging Python lib doesn't work either (no logs outputted inside mapped functions). I'm trying to find another approach for logging.
Ref: https://stackoverflow.com/questions/36022988/in-pyspark-how-can-i-log-to-log4j-from-inside-a-transformation

@dreamtalen dreamtalen force-pushed the policy-reco branch 2 times, most recently from 63fbf9e to 4b55b90 Compare June 2, 2022 00:15
svc_acnp_list = svc_acnp_rdd.collect()
if deny_rules:
if option == 1:
# Recommend deny ANPs for the applied to groups of allow policies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: appliedTo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank Jianjun, addressed.

logger.error("Error: option {} is not valid".format(option))
return []
if option == 3:
# Recommend k8s native network policies for unprotected flows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: NetworkPolicies

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We capitalize the first K8s resource or CRD kinds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

plugins/policy-recommendation/policy_recommendation_job.py Outdated Show resolved Hide resolved
plugins/policy-recommendation/policy_recommendation_job.py Outdated Show resolved Hide resolved
plugins/policy-recommendation/policy_recommendation_job.py Outdated Show resolved Hide resolved
Copy link
Contributor

@salv-orlando salv-orlando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think code LGTM. Hope issues with logging have been sorted out.

return flow_df

def write_recommendation_result(spark, result, recommendation_type, db_jdbc_address, table_name, id):
if not id:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (perhaps to be addressed in a future PR): id is a reserved keyword in python. In this case it will be correctly interpreted, but using it is a risk for a maintainability perspective (e.g.: we change the param name to something else, no error is thrown when running the code, but "if not id" will always be false!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank Salvatore, that's a fair concern. I rename this parameter to recommendation_id_input instead.

Signed-off-by: Yongming Ding <dyongming@vmware.com>
@dreamtalen dreamtalen merged commit 75f6968 into antrea-io:main Jun 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants