This is a notebook to process and visualize Spark event log data.
You can install dependencies with pipenv install
and launch the notebook with pipenv run jupyter notebook
. Open metrics.ipynb
and then edit the first notebook cell to point metrics_file
to the path of a Spark event log.
As above, install dependencies with pipenv install
.
You can then run the preprocessor over one or several log files with pipenv run ./preprocess.py
. This script has online help, but here's something to get you started:
pipenv run python preprocess.py --master 'local[*]' --per-app-db --fail-fast application_161108976543_0123 --db metrics.db --outdir outputs
where application_161108976543_0123
is a Spark event log file. You can specify one log file or several. After running this command, outputs
will contain SQLite databases for each event log and metrics.db
will contain summary information for every event log.
You can then start an API server for these databases with this command:
pipenv run datasette outputs/*.db --config sql_time_limit_ms:50000 --config max_returned_rows:1024768 --cors --metadata outputs/metadata.json --template-dir=templates
Navigate to localhost:8001 to interact with individual tables. Note that the schema of these tables will change in the future.
If you have dot
installed, you can run this example script against the existing API:
curl http://localhost:8001/${APP_NAME}/plan-graph.json\?_shape=array | ./querygraph.py | dot -Tpdf > graph.pdf
where ${APP_NAME}
is an application ID from the first step above.
Copyright (c) 2020-2021 NVIDIA Corporation