-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix kernel generation for Spark Yarn // TOREE-97 #141
base: master
Are you sure you want to change the base?
Conversation
There is a failure on the CI that doesn't look related to that patch:
Unless writing the paths of those 2 env vars are so big that is consuming all the storage! =) |
It would be also useful to export |
What is the difference with setting |
Good question. So, it's clear that, if Spark configuration aren't in the default location, the user needs to be able to inform Spark where they are -- for this,
|
Maybe I am misunderstanding this, while with vanilla Toree you have the option to get your own local spark started (e.g. local[...]), when considering an enterprise environment where there is a large Spark cluster managed by Yarn you just need to connect to it, thus all these configuration being managed by spark and Hadoop configuration files directly. Also, in most of the enterprise deployments, the cluster is deployed based on some distribution which includes many other components, and we don't want to, and we don't need to, make Toree aware of them. Anyway, what is the scenario you are trying to accomplish with these changes? |
", thus all these configuration being managed by spark and Hadoop configuration files directly." sure, but how do you tell Hadoop and Spark where to find the The scenario is very simple: in Bright Cluster Manager, users can have many Hadoop instances and many Spark instances. And they're able to connect any of those Spark instances with Yarn of any of those Hadoop instances. If there are many instances, there are many configurations files, and so, they cannot be in the default location, right? We could have a Jupyter/JupyterHub/Toree deployment (like we do with other tools) per Spark instance. But things are much simpler: we just need to add one Indeed, this PR is not really a requirement for us -- during our integration process, we already update the JSON files to contain those variables,so our user already benefit from the possibility of accessing a Toree kernel per Spark instance. We are just really trying to give back our 2 cents for the case there are Toree vanilla users trying to achieve the same without our product. |
help='''Specify where the hadoop config files can be found.''' | ||
) | ||
spark_conf_dir = Unicode(os.getenv(SPARK_CONF_DIR, '/usr/local/spark'), config=True, | ||
help='''Specify where the spark config files can be found.''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the default value actually be /usr/local/spark/conf ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, it has to be the place where spark-env.sh
is found.
@@ -57,6 +59,12 @@ class ToreeInstall(InstallKernelSpec): | |||
spark_home = Unicode(os.getenv(SPARK_HOME, '/usr/local/spark'), config=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, do we actually want to use default values here? I am assuming we don't really have a standard default place to deploy Spark/Hadoop and maybe it would be better to use the env variables if they are available, otherwise, ignore or in case of required ones throw an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they're set to be empty, it's just like they're not set -- so it's true that probably they're better empty.
@ribamar-santarosa Thanks for the explanation, I believe I understand the scenario now and have just minor comments on the changes. |
It looks like the TOREE-97 issue -- support for Spark Yarn was closed without definitive solution (or something went wrong on the way). Toree does support it, but it won't work if a user doesn't add manually in their kernel.json definition, the env vars for
HADOOP_CONF_DIR
. Without that env var, Spark doesn't know what to do with the option--master=yarn
(set in__TOREE_SPARK_OPTS__
). It would be desirable to have it by default, and this patch provides this functionality.Probably this is not the nicest way to solve the problem, because it just hard codes more vars into the JSON file -- ideally it would be nice to have an interface to add or remove env vars from those files, however,
HADOOP_CONF_DIR
andSPARK_CONF_DIR
look basic to be exported. Even for an Spark Standalone deployment,HADOOP_CONF_DIR
won't hurt. So, here it goes our 2 cents to improve a bit the situation.I cloned the TOREE-97 into TOREE-438 to sign this issue.