Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[str] Add LiteralString overload for __getitem__ #12714

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

pradeep90
Copy link

In PEP 675, @gbleaney and I had specified a list of LiteralString-preserving overloads for str. However, we didn't specify an overload for __getitem__ and didn't give any rationale either. IIRC this was an edge case we didn't want to take a strong decision on unless users wanted it.

@carljm brought this up yesterday, so I think it's worth discussing.

Pro: my_literal_string[i] or my_literal_string[i:j] should technically be compatible with LiteralString, since it is a substring of a literal-derived string.

Con: The main downside is that an attacker might control the indexes and try to access a specific substring from a literal string in the code. For example, they might narrow down the string to rm foo or SELECT *.

It's true that join and other methods could also construct dangerous strings from LiteralStrings, and we even call that out as an accepted tradeoff in the PEP:

  1. Trivial functions could be constructed to convert a str to a LiteralString:
     def make_literal(s: str) -> LiteralString:
         letters: Dict[str, LiteralString] = {
             "A": "A",
             "B": "B",
             ...
         }
         output: List[LiteralString] = [letters[c] for c in s]
         return "".join(output)

We could mitigate the above using linting, code review, etc., but ultimately a clever, malicious developer attempting to circumvent the protections offered by LiteralString will always succeed. The important thing to remember is that LiteralString is not intended to protect against malicious developers; it is meant to protect against benign developers accidentally using sensitive APIs in a dangerous way (without getting in their way otherwise).

Without LiteralString, the best enforcement tool API authors have is documentation, which is easily ignored and often not seen. With LiteralString, API misuse requires conscious thought and artifacts in the code that reviewers and future developers can notice.

-- PEP 675 - Appendix B: Limitations

__getitem__, however, seems a bit different, because it (and split, zfill, etc.) accept an index or width that could be used to construct a dangerous query or a humongous string.

So, we need to clarify the intent a little. What was the intent of these overloads? We wanted to forbid "arbitrary user-supplied strings" while allowing methods that preserved literal strings. We were not trying to prevent every possible exploit on the string. Since __getitem__ forbids arbitrary user-supplied strings and preserves literal strings, I think we should add an overload for it.

Thanks to @carljm for bringing this up!

cc @gbleaney @JelleZijlstra in case they have different opinions. Would also love to hear from other users.

In PEP 675, @gbleaney and I had specified a list of `LiteralString`-preserving [overloads](https://peps.python.org/pep-0675/#appendix-c-str-methods-that-preserve-literalstring) for `str`. However, we didn't specify an overload for `__getitem__` and didn't give any rationale either. IIRC this was an edge case we didn't want to take a strong decision on unless users wanted it.

@carljm brought this up yesterday, so I think it's worth discussing.

Pro: `my_literal_string[i]` or `my_literal_string[i:j]` should technically be compatible with `LiteralString`, since it is a substring of a literal-derived string.

Con: The main downside is that an attacker might control the indexes and try to access a specific substring from a literal string in the code. For example, they might narrow down the string to `rm foo` or `SELECT *`.

It's true that `join` and other methods could also construct dangerous strings from `LiteralString`s, and we even call that out as an accepted tradeoff in the PEP:

> 4. Trivial functions could be constructed to convert a str to a LiteralString:
>
>     def make_literal(s: str) -> LiteralString:
>         letters: Dict[str, LiteralString] = {
>             "A": "A",
>             "B": "B",
>             ...
>         }
>         output: List[LiteralString] = [letters[c] for c in s]
>         return "".join(output)
>
> We could mitigate the above using linting, code review, etc., but ultimately a clever, malicious developer attempting to circumvent the protections offered by LiteralString will always succeed. The important thing to remember is that LiteralString is not intended to protect against malicious developers; it is meant to protect against benign developers accidentally using sensitive APIs in a dangerous way (without getting in their way otherwise).
>
> Without LiteralString, the best enforcement tool API authors have is documentation, which is easily ignored and often not seen. With LiteralString, API misuse requires conscious thought and artifacts in the code that reviewers and future developers can notice.
>
> -- [PEP 675 - Appendix B: Limitations](https://peps.python.org/pep-0675/#appendix-b-limitations)

`__getitem__`, however, seems a bit different, because it (and `split`, `zfill`, etc.) accept an index or width that could be used to construct a dangerous query or a humongous string. So, we need to clarify the intent a little.

What was the intent of these overloads? We wanted to forbid "arbitrary user-supplied strings" while allowing methods that preserved literal strings. We were not trying to prevent every possible exploit on the string. Since `__getitem__` forbids arbitrary user-supplied strings and preserves literal strings, I think we should add an overload for it.

Thanks to @carljm for bringing this up!

cc @gbleaney @JelleZijlstra in case they have different opinions. Would also love to hear from other users.
…load.

This should fix the error:

```
/opt/hostedtoolcache/Python/3.10.15/x64/bin/python -m mypy.stubtest --check-typeshed --custom-typeshed-dir . --allowlist stdlib/@tests/stubtest_allowlists/common.txt --allowlist stdlib/@tests/stubtest_allowlists/linux.txt --allowlist stdlib/@tests/stubtest_allowlists/py310.txt --allowlist stdlib/@tests/stubtest_allowlists/linux-py310.txt
error: not checking stubs due to mypy build errors:
stdlib/builtins.pyi:595: error: Overloaded function signature 2 will never be matched: signature 1's parameter type(s) are the same or broader  [misc]
```

I'm not sure why there is an error, because the overload seems valid to me. Other `LiteralString` overloads have the same `type: ignore[misc]`, so I'm just going to copy them.
Copy link
Contributor

github-actions bot commented Oct 1, 2024

Diff from mypy_primer, showing the effect of this PR on open source code:

streamlit (https://github.com/streamlit/streamlit)
- lib/tests/streamlit/elements/lib/column_config_utils_test.py:385:30: error: Invalid index type "str" for "Union[ColumnConfig, str, None]"; expected type "Union[SupportsIndex, slice]"  [index]
+ lib/tests/streamlit/elements/lib/column_config_utils_test.py:385:13: error: No overload variant of "__getitem__" of "str" matches argument type "str"  [call-overload]
+ lib/tests/streamlit/elements/lib/column_config_utils_test.py:385:13: note: Possible overload variants:
+ lib/tests/streamlit/elements/lib/column_config_utils_test.py:385:13: note:     def __getitem__(self, Union[SupportsIndex, slice], /) -> str

prefect (https://github.com/PrefectHQ/prefect)
- src/prefect/cli/cloud/__init__.py:325: error: Invalid index type "str" for "dict[UUID, list[Workspace]]"; expected type "UUID"  [index]
- src/prefect/cli/cloud/__init__.py:325: error: Invalid index type "str" for "str"; expected type "SupportsIndex | slice"  [index]
+ src/prefect/cli/cloud/__init__.py:325: error: No overload variant of "__getitem__" of "str" matches argument type "str"  [call-overload]
+ src/prefect/cli/cloud/__init__.py:325: note: Possible overload variants:
+ src/prefect/cli/cloud/__init__.py:325: note:     def __getitem__(self, SupportsIndex | slice, /) -> str

dd-trace-py (https://github.com/DataDog/dd-trace-py)
+ ddtrace/settings/config.py:840: error: Unused "type: ignore" comment  [unused-ignore]
+ ddtrace/settings/config.py:840: error: No overload variant of "__getitem__" of "str" matches argument type "str"  [call-overload]
+ ddtrace/settings/config.py:840: note: Error code "call-overload" not covered by "type: ignore" comment
+ ddtrace/settings/config.py:840: note: Possible overload variants:
+ ddtrace/settings/config.py:840: note:     def __getitem__(self, SupportsIndex | slice, /) -> str

Copy link
Member

@AlexWaygood AlexWaygood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this makes sense. I can't think of any way in which this would be any less safe from a security perspective than the existing LiteralString overload for str.__mul__, which also allows arbitrary integers and could also create a potentially huge string.

@AlexWaygood
Copy link
Member

Diff from mypy_primer, showing the effect of this PR on open source code:

I think this can just be ignored, because mypy still doesn't support LiteralString, and patches their copy of builtins.pyi so that all LiteralString overloads are removed from it.

@carljm
Copy link
Member

carljm commented Oct 1, 2024

This looks good to me, thanks @pradeep90 !

I don't understand the mypy-primer errors, but I'll take @AlexWaygood 's word for it that they aren't a problem.

@AlexWaygood
Copy link
Member

I don't understand the mypy-primer errors, but I'll take @AlexWaygood 's word for it that they aren't a problem.

It's a thing mypy does so they can halfway-but-not-really support LiteralString -- they patch typing.pyi so that LiteralString is just an alias to str. But that in turn means that they think all the LiteralString overloads in builtins.pyi overlap, causing spurious errors like those you see in mypy_primer here, which is why they remove all use of LiteralString from their patched version of builtins.pyi. TL;DR we don't need to worry about these primer errors: mypy users won't see them.

@JelleZijlstra
Copy link
Member

The mypy-primer errors seem to be because mypy issues a different error code for errors involving __getitem__ if __getitem__ is overloaded (e.g., https://github.com/DataDog/dd-trace-py/blob/7f6f3dbe7c2e00fe84cdf944e1121304306fb649/ddtrace/settings/config.py#L840).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants