Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop abusing git for timing data storage #36

Closed
jrevels opened this issue Jul 24, 2017 · 8 comments
Closed

stop abusing git for timing data storage #36

jrevels opened this issue Jul 24, 2017 · 8 comments

Comments

@jrevels
Copy link
Member

jrevels commented Jul 24, 2017

Should've made this issue a long time ago.

Each Nanosoldier run generates a fair amount of timing data, which is currently stored in https://github.com/JuliaCI/BaseBenchmarkReports. As was discussed way back in the early days of Nanosoldier, this is a pretty gross abuse of git/GitHub.

We could instead just dump the data on a publicly accessible filesystem (and eventually, let the data be ingested by a more granularly queryable database).

While not directly tied to this issue, it'd also be nice to tackle the old lightweight/stable/portable serialization issue at the same time. It should be simple enough to write a JSON (de)serializer for the list of (benchmark key, BenchmarkTools.Trial) pairs you'd need to store.

@simonbyrne
Copy link
Member

What about sticking it in some cloud nosql store (DynamoDB, BigTable, etc.)?

@StefanKarpinski
Copy link

How about producing CSV and pushing it to S3?

@jrevels
Copy link
Member Author

jrevels commented Jul 28, 2017

How about producing CSV and pushing it to S3?

The data's not well structured for CSV (AFAICT you'd have to have many, many smallish files), but yeah, any simple storage solution should work fine for this to start.

@vtjnash
Copy link
Member

vtjnash commented Mar 4, 2021

The json files compress well (e.g. JuliaCI/BenchmarkTools.jl#79), so while we have generated a lot of data, and may want to add a TSDB for other reasons, the current rate of growth isn't terrible:

$ du -sh NanosoldierReports/
13G     NanosoldierReports/
$ du -sh NanosoldierReports/.git
5.8G    NanosoldierReports/.git

$ du -sh NanosoldierReports/pkgeval/by_date/latest/
11M     NanosoldierReports/pkgeval/by_date/latest/

$ du -sh NanosoldierReports/benchmark/by_date/2021-02/17/
7.4M    NanosoldierReports/benchmark/by_date/2021-02/17/

What might make the most difference is doing each by_date run as an update to latest directly (instead of keeping each copy in a folder). That would make the most use of git's abilities to compare logs and delta-compress changes.

@KristofferC
Copy link
Contributor

I kind of doubt the value in saving the timing data for each trial and think it is enough to save the statistics. Sure, people said some years ago that someone maybe would want to do some analysis on the data but looking at the activity in these reports, I doubt that will happen, nor bring anything actionable for old runs.

@vtjnash
Copy link
Member

vtjnash commented Sep 7, 2021

The isdaily reports are generated from the data.tar.xz files

@KristofferC
Copy link
Contributor

Yes, in there is a ~500 MB json file in there that contains the timing for every sample for every benchmark. I don't think anything has been done with those trial timings other than computing the minimum of it. So I am saying to just store e.g. the minimum and reduce the file size by a couple of order of magnitudes.

@maleadt
Copy link
Member

maleadt commented Jan 17, 2023

@maleadt maleadt closed this as completed Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants