Skip to content
This repository has been archived by the owner on Dec 5, 2019. It is now read-only.

Backfill SparkJobRun.log_uri for old runs #520

Closed
robhudson opened this issue May 31, 2017 · 4 comments
Closed

Backfill SparkJobRun.log_uri for old runs #520

robhudson opened this issue May 31, 2017 · 4 comments

Comments

@robhudson
Copy link
Member

In #477 we will start recording log_uri for each spark job run. The runs prior to this landing, however, do not have a relation from the run to the log. The EMR cluster details API can provide us with the LogUri that we can use to populate these but it would be an API call per run so we want to be mindful of API limits. The idea would be to create a celery task that backfills in batches of x and spreads them out in celery tasks every y minutes.

@jezdez
Copy link
Contributor

jezdez commented May 31, 2017

So I'm not sure if #477 is enough, since what we do is create logfiles as part of the job processing, in batch.sh: https://github.com/mozilla/emr-bootstrap-spark/blob/b3c7412b2f6b61c02b27125d6cad5935c16985ad/ansible/files/steps/batch.sh#L51 (and following lines)

Here's where the logs are uploaded to S3: https://github.com/mozilla/emr-bootstrap-spark/blob/b3c7412b2f6b61c02b27125d6cad5935c16985ad/ansible/files/steps/batch.sh#L112

@jezdez
Copy link
Contributor

jezdez commented May 31, 2017

The log_uri thing as I understand it is the log of the cluster bootstrapping, which is not the same as the logfiles of the actual job.

@robhudson
Copy link
Member Author

Thanks for pointing out that the LogUri we pass isn't the job logs we display in ATMO.

Would it be worth it to try to match up the log files in the S3 bucket with the historical job runs? Maybe listing all the log files sorted by time and attaching them to the runs sorted by time would be close enough?

Or just drop it all together?

@rafrombrc
Copy link
Member

We've got another approach for handling #477 (using cluster id to generate the log URLs) but we won't be able to do the backfill.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants