Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python dependency inference: add/emit debugging information #17039

Closed
lilatomic opened this issue Sep 28, 2022 · 4 comments · Fixed by #17057
Closed

Python dependency inference: add/emit debugging information #17039

lilatomic opened this issue Sep 28, 2022 · 4 comments · Fixed by #17057
Labels
backend: Python Python backend-related issues enhancement

Comments

@lilatomic
Copy link
Contributor

Is your feature request related to a problem? Please describe.
There currently isn't any information emitted about what dependencies were inferred. If dependency inference fails, developers don't have much to pinpoint the error.

For concrete usecases:

  • I ran into an issue where dependent files weren't pulled into a test PEX. This was due to some shenanigans I had with pants source roots, where the imports were resolved from the repository root, not the closest parent source root.
  • I ran into an issue where guarded imports weren't pulled into a test PEX. I had the incorrect module path, but the PEX built fine because they were weak imports

Describe the solution you'd like
A good first pass would be having the ability to output,

  • for every file, the imports found and whether they were resolved.
  • the map of 3rd party dependencies to modules
  • the map of 1st party targets to modules

This would help developers confirm that the imports were found and understand if the targets of the import were found or ignored. For the example, with the second usecase, the imports found I would expect something like the following to let me know that it was deliberate and acceptable that there wasn't an error not resolving the imports:

[
  {
    "path": "//folder/file.py",
    "imports": [
      {
        "import": "a.b.c",
        "weak": true,
        "resolved": false
      }
    ]
  }
]

Describe alternatives you've considered
The current output is just checkpointing. With -ltrace, these are what is printed (many messages elided):

11:31:51.66 [DEBUG] Completed: Find all targets in the project
11:31:51.66 [DEBUG] Completed: Find all Python targets in project
11:31:51.66 [DEBUG] Completed: Creating map of third party targets to Python modules
11:31:51.67 [DEBUG] Completed: Creating map of first party Python targets to Python modules
11:31:51.67 [TRACE] Completed: Inferring Python dependencies by analyzing source
11:31:51.67 [TRACE] Completed: Inferring Python dependencies by analyzing source

Additional context
rel: #13283 ; this requests information on dependency inference, that requests graphing the normal dependencies. Discussion in that mentions graphing at different scopes, this mentions a scope below what could ordinarily be graphed by that.

@lilatomic
Copy link
Contributor Author

I think there's a lot of information that will be output, so something like peek's --peek-output-file might be appropriate

@stuhood
Copy link
Member

stuhood commented Sep 28, 2022

There is some (informal) prior art here in the JVM backend (and the go backend, as well, actually):

Those goals dump the sources of thirdparty dependencies as JSON, and the exact extracted symbols per file (respectively).

If symbol extraction were standardized across languages behind a @union, a goal like this could be done in a language-agnostic manner. But failing that, adding a debug goal like this would also be an option. cc @tdyas

@Eric-Arellano
Copy link
Contributor

I think that a debug_goals backend is a good idea for Python. This ticket would be awesome! Thanks for the suggestion.

@stuhood
Copy link
Member

stuhood commented Sep 28, 2022

I think there's a lot of information that will be output, so something like peek's --peek-output-file might be appropriate

Oh, hm! It just occurred to me that another connection to your peek idea is #16967: essentially, you could think of the extracted imports / consumed-symbols of a file as effectively computed "file metadata" about that file. If we had additional generic computed per-file metadata like this, then peek might be a natural place to (optionally) render it...

@tdyas tdyas added the backend: Python Python backend-related issues label Oct 2, 2022
stuhood pushed a commit that referenced this issue Nov 17, 2022
See #17039.

Given a testbed of 
<details>
  <summary>input</summary>

```python
# Copyright 2022 Pants project contributors (see CONTRIBUTORS.md).
# Licensed under the Apache License, Version 2.0 (see LICENSE).

import json  # unownable, root level
import os.path  # unownable, not root level

import watchdog  # dependency not included
import yaml  # dependency included
import yamlpath  # owned by other resolve

try:
    import weakimport  # weakimport missing
except ImportError:
    ...

open("src/python/configs/prod.json")  # asset
open("testprojects/pants-plugins/src/python/test_pants_plugin/__init__.py")
```
</details>

we get 

<details>
  <summary>output</summary>

```
{
  "src/python/pants/backend/python/dependency_inference/t.py": {
    "imports": [
      {
        "name": "weakimport",
        "reference": {
          "lineno": 12,
          "weak": true
        },
        "resolved": {
          "status": "ImportOwnerStatus.weak_ignore",
          "address": []
        },
        "possible_resolve": null
      },
      {
        "name": "json",
        "reference": {
          "lineno": 4,
          "weak": false
        },
        "resolved": {
          "status": "ImportOwnerStatus.unownable",
          "address": []
        },
        "possible_resolve": null
      },
      {
        "name": "os.path",
        "reference": {
          "lineno": 5,
          "weak": false
        },
        "resolved": {
          "status": "ImportOwnerStatus.unownable",
          "address": []
        },
        "possible_resolve": null
      },
      {
        "name": "watchdog",
        "reference": {
          "lineno": 7,
          "weak": false
        },
        "resolved": {
          "status": "ImportOwnerStatus.unowned",
          "address": []
        },
        "possible_resolve": null
      },
      {
        "name": "yaml",
        "reference": {
          "lineno": 8,
          "weak": false
        },
        "resolved": {
          "status": "ImportOwnerStatus.unambiguous",
          "address": [
            "3rdparty/python#PyYAML",
            "3rdparty/python#types-PyYAML"
          ]
        },
        "possible_resolve": null
      },
      {
        "name": "yamlpath",
        "reference": {
          "lineno": 9,
          "weak": false
        },
        "resolved": {
          "status": "ImportOwnerStatus.unowned",
          "address": []
        },
        "possible_resolve": [
          [
            "src/python/pants/backend/helm/subsystems:yamlpath",
            "helm-post-renderer"
          ]
        ]
      }
    ],
    "assets": [
      {
        "name": "src/python/configs/prod.json",
        "reference": "src/python/configs/prod.json",
        "resolved": {
          "status": "ImportOwnerStatus.unowned",
          "address": []
        },
        "possible_resolve": null
      },
      {
        "name": "testprojects/pants-plugins/src/python/test_pants_plugin/__init__.py",
        "reference": "testprojects/pants-plugins/src/python/test_pants_plugin/__init__.py",
        "resolved": {
          "status": "ImportOwnerStatus.unambiguous",
          "address": [
            "testprojects/pants-plugins/src/python/test_pants_plugin/__init__.py:../../../../pants_plugins_directory"
          ]
        },
        "possible_resolve": null
      }
    ]
  }
}
```
</details>

Telling you, for each file, for each import, what dependencies pants thought it could have, and what it decided to do with them.
This uses almost all the same code as the main dependency inference code, with the exception of the top-level orchestration of it. I think that's pretty close, there's about 100 lines of semi-duplicate code.

There's also a more advanced mode that dumps information about each stage of the process. I think this might be useful for people digging through the dependency inference process but not really for end-users. we get it for free, though.

Fixes #17039.

---

this is fairly critical for performance, so here are benchmarks (with comparison-of-means t-test)

|   | main | this | difference | P-score |
| --- | --- | --- | --- | --- |
| `hyperfine --runs=10 './pants --no-pantsd dependencies --transitive ::'` | 21.839 s ±  0.326 s | 22.142 s ±  0.283 s | 1.38% | 0.0395 |
| `hyperfine --warmup=1 --runs=10 './pants dependencies --transitive ::'` | 1.798 s ±  0.074 s | 1.811 s ±  0.076 s | 0.72% | 0.7029 |
| `hyperfine --runs=10 './pants --no-pantsd dependencies ::'` | 21.547 s ±  0.640 s  | 21.863 s ±  1.072 s | 1.47% | 0.4339 |
| `hyperfine --warmup=1 --runs=10 './pants dependencies ::'` | 1.828 s ±  0.091 s | 1.844 s ±  0.105 s | 0.88% | 0.7200 |

So it looks like this MR might impact performance, by about 1%, although those p-values are mighty unconvincing. LMK if we want to increase runs and get more statistics, I've run the stats a few times throughout and this looks about right, so I think we can proceed with the review under the assumption that there is currently a 1% performance overhead. I'm open to suggestions on improving performance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: Python Python backend-related issues enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants