-
Notifications
You must be signed in to change notification settings - Fork 29
Backfill SparkJobRun.log_uri for old runs #520
Comments
So I'm not sure if #477 is enough, since what we do is create logfiles as part of the job processing, in batch.sh: https://github.com/mozilla/emr-bootstrap-spark/blob/b3c7412b2f6b61c02b27125d6cad5935c16985ad/ansible/files/steps/batch.sh#L51 (and following lines) Here's where the logs are uploaded to S3: https://github.com/mozilla/emr-bootstrap-spark/blob/b3c7412b2f6b61c02b27125d6cad5935c16985ad/ansible/files/steps/batch.sh#L112 |
The log_uri thing as I understand it is the log of the cluster bootstrapping, which is not the same as the logfiles of the actual job. |
Thanks for pointing out that the LogUri we pass isn't the job logs we display in ATMO. Would it be worth it to try to match up the log files in the S3 bucket with the historical job runs? Maybe listing all the log files sorted by time and attaching them to the runs sorted by time would be close enough? Or just drop it all together? |
We've got another approach for handling #477 (using cluster id to generate the log URLs) but we won't be able to do the backfill. |
In #477 we will start recording
log_uri
for each spark job run. The runs prior to this landing, however, do not have a relation from the run to the log. The EMR cluster details API can provide us with theLogUri
that we can use to populate these but it would be an API call per run so we want to be mindful of API limits. The idea would be to create a celery task that backfills in batches ofx
and spreads them out in celery tasks everyy
minutes.The text was updated successfully, but these errors were encountered: