Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confirm collected pystats are reliable #511

Closed
mdboom opened this issue Dec 9, 2022 · 5 comments
Closed

Confirm collected pystats are reliable #511

mdboom opened this issue Dec 9, 2022 · 5 comments
Assignees

Comments

@mdboom
Copy link
Contributor

mdboom commented Dec 9, 2022

Now that pystats collection is automated, we should confirm that we trust the results. We did notice that the new results are quite a bit different from the manually generated ones from a few months ago. The expectation is that the new ones don't include the pyperformance/pyperf machinery itself, but the differences seem wildly off.

@mdboom mdboom self-assigned this Dec 9, 2022
@mdboom
Copy link
Contributor Author

mdboom commented Dec 9, 2022

I think I've found the crux of the problem: when the process is in the "stats off" state (after calling sys._stats_off), stats aren't dumped. When we run a benchmark, we surround the benchmark code with sys._stats_on() and sys._stats_off(), and then there's nothing dumped out at the end of the process. The fact that we are getting any stats at all is a sign of leakage from non-benchmark code (which I'll consider a separate issue that I haven't yet solved).

There's a couple ways to solve this:

  • Modify pyperf to call sys._stats_dump(); sys._stats_off() after running benchmark code. This will leak the operation of calling sys._stats_dump() itself, but maybe not a huge deal.
  • Modify CPython to always dump whatever is left, even if stats is turned off. I think this is probably the better, less surprising solution.

@markshannon Thoughts?

@markshannon
Copy link
Member

Stats should be dumped, even if stats are off.

Because stats are on at startup, any non-benchmark process will dump stats, so that also needs to be fixed.

Maybe start with stats off by default, and turn them on during startup if an -Xstats option is enabled?

That way the stats should work for pyperformance as it is now, but we still have the option to gather stats for a whole program.

@mdboom
Copy link
Contributor Author

mdboom commented Dec 9, 2022

See python/cpython#100144

@sweeneyde
Copy link

It seems the .json files are not comparable to each other because they have keys like "opcode[157].pair_count[145]", which changes meanings as bytecodes change.

Couldn't _Py_GetSpecializationStats be modified to output _PyOpcode_OpName[i] instead of the integer i? I wonder to what degree that would clean up the various scripts. It would also make it so that a different python could run the summary script than the one whose stats are being taken.

@mdboom
Copy link
Contributor Author

mdboom commented Dec 12, 2022

The .json files are just shorter summaries of the raw output, which is designed to be as easy as possible to generate from C. (The C code doesn't have access to the opcode names, or would have to add a table for that or import opcode.py etc...)

I'm not sure there's a lot of value in making this data format well-defined and permanent. The use case for comparing really only makes sense with adjacent revisions anyway. Personally, I see the requirement to use a matching version of the script as a feature -- we can change the semantics of how this works as we understand the problems better without concern for backward compatibility.

All that said, we should probably agree on a policy about this and document it somewhere.

@mdboom mdboom closed this as completed Dec 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants