Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix kernel generation for Spark Yarn // TOREE-97 #141

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ribamar-santarosa
Copy link

@ribamar-santarosa ribamar-santarosa commented Sep 12, 2017

It looks like the TOREE-97 issue -- support for Spark Yarn was closed without definitive solution (or something went wrong on the way). Toree does support it, but it won't work if a user doesn't add manually in their kernel.json definition, the env vars for HADOOP_CONF_DIR. Without that env var, Spark doesn't know what to do with the option --master=yarn (set in __TOREE_SPARK_OPTS__). It would be desirable to have it by default, and this patch provides this functionality.

Probably this is not the nicest way to solve the problem, because it just hard codes more vars into the JSON file -- ideally it would be nice to have an interface to add or remove env vars from those files, however, HADOOP_CONF_DIR and SPARK_CONF_DIR look basic to be exported. Even for an Spark Standalone deployment, HADOOP_CONF_DIR won't hurt. So, here it goes our 2 cents to improve a bit the situation.

I cloned the TOREE-97 into TOREE-438 to sign this issue.

@ribamar-santarosa
Copy link
Author

There is a failure on the CI that doesn't look related to that patch:

failed to register layer: Error processing tar file(exit status 1): write /opt/conda/envs/python2/lib/python2.7/site-packages/Cython/Compiler/Code.so: no space left on device

Unless writing the paths of those 2 env vars are so big that is consuming all the storage! =)

@lammic
Copy link
Contributor

lammic commented Sep 12, 2017

It would be also useful to export JAVA_HOME, in the case I do not want to use the default one, but a specific release.

@lresende
Copy link
Member

What is the difference with setting HADOOP_CONF_DIR in $SPARK_HOME/spark-env.sh ?
More generically, why manage system wide configurations in the kernelspec?

@ribamar-santarosa
Copy link
Author

Good question. So, it's clear that, if Spark configuration aren't in the default location, the user needs to be able to inform Spark where they are -- for this, SPARK_CONF_DIR.

HADOOP_CONF_DIR is a bit trickier, because the standard idea is to think that Spark on Yarn is tied to a single instance of Hadoop -- so spark-env.sh suffices. Like anything in computing, somebody will try to expand a 1-1 relationship to 1-N -- we can make module load another_instance_of_hadoop , that will dynamically overwrite HADOOP_CONF_DIR . Then, we can go and install a Toree kernel for that tuple Hadoop-Spark.

@lresende
Copy link
Member

Maybe I am misunderstanding this, while with vanilla Toree you have the option to get your own local spark started (e.g. local[...]), when considering an enterprise environment where there is a large Spark cluster managed by Yarn you just need to connect to it, thus all these configuration being managed by spark and Hadoop configuration files directly. Also, in most of the enterprise deployments, the cluster is deployed based on some distribution which includes many other components, and we don't want to, and we don't need to, make Toree aware of them.

Anyway, what is the scenario you are trying to accomplish with these changes?

@ribamar-santarosa
Copy link
Author

", thus all these configuration being managed by spark and Hadoop configuration files directly." sure, but how do you tell Hadoop and Spark where to find the env.sh file, if they're not in the default location? With SPARK_CONF_DIR and HADOOP_CONF_DIR.

The scenario is very simple: in Bright Cluster Manager, users can have many Hadoop instances and many Spark instances. And they're able to connect any of those Spark instances with Yarn of any of those Hadoop instances. If there are many instances, there are many configurations files, and so, they cannot be in the default location, right? We could have a Jupyter/JupyterHub/Toree deployment (like we do with other tools) per Spark instance. But things are much simpler: we just need to add one kernel.json per Spark instance, with different SPARK_CONF_DIR and HADOOP_CONF_DIR. If those variables woudn't be there, all of those kernel.json would be identical, and then, how to tell which one to use?

Indeed, this PR is not really a requirement for us -- during our integration process, we already update the JSON files to contain those variables,so our user already benefit from the possibility of accessing a Toree kernel per Spark instance. We are just really trying to give back our 2 cents for the case there are Toree vanilla users trying to achieve the same without our product.

help='''Specify where the hadoop config files can be found.'''
)
spark_conf_dir = Unicode(os.getenv(SPARK_CONF_DIR, '/usr/local/spark'), config=True,
help='''Specify where the spark config files can be found.'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the default value actually be /usr/local/spark/conf ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, it has to be the place where spark-env.sh is found.

@@ -57,6 +59,12 @@ class ToreeInstall(InstallKernelSpec):
spark_home = Unicode(os.getenv(SPARK_HOME, '/usr/local/spark'), config=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, do we actually want to use default values here? I am assuming we don't really have a standard default place to deploy Spark/Hadoop and maybe it would be better to use the env variables if they are available, otherwise, ignore or in case of required ones throw an error?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they're set to be empty, it's just like they're not set -- so it's true that probably they're better empty.

@lresende
Copy link
Member

@ribamar-santarosa Thanks for the explanation, I believe I understand the scenario now and have just minor comments on the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants