Preserve header casing #103

tomchristie · 2020-08-25T11:57:36Z

A proposal for preserving header casing information in both directions, while still exposing lower-case only to users by default, and keeping careful type separation.

Essentially a take on @Lukasa's comment here #31 (comment) and @njsmith's follow up...

write a data structure that is basically a tuple, but allows you to access headers either case sensitively or case insensitively (where case insensitively is the default)

Here's a really simple (maybe too simple) idea to throw into the mix: use tuples of (header, header_with_original_casing, value)

Here's how it looks to the end-user...

h11 now preserves header casing on sending headers.
.headers becomes a Sequence type, rather than a raw list. Iterating over the sequence continues to return (<lowercased-name>, <value>) pairs.
.headers.raw() is available for usages such as console or debug output that require original casing information, and returns a list of (<lowercased-name>, <raw-name>, <value>) three-tuples.

I've addressed this commit-by-commit, which should help make the approach I've taken here clear...

Headers becomes a Sequence type, rather than a raw mutable list, but continues to store exactly the same information.
Store raw casing information in the Headers type, but don't use it anywhere.
Use title casing anywhere we're using get_comma_header, set_comma_header. Both functions continue to be case insensitive in their effects, but it will matter in the set_comma_header case because it'll give us nice header casing on the over-the-wire bytes. Switching both over within the codebase for consistency.
Switch the writer to use the raw header casing.

Strictly speaking there is an API change here, in that .headers on events are now sequences, rather than plain lists. Any user code that is doing grungy stuff by mutating that data-structure in-place, wouldn't function after this. But that's a bit of a hacky broken thing to be doing anyway, so a version bump that tightened up the API spec into ".headers is an immutable sequence" seems reasonable enough right?

Anyway's putting this out there, so we've got something to discuss. 🤔

Thanks so much for maintaining such a fantastically careful & thoroughly designed library. It's a joy to work with. ✨

codecov · 2020-08-25T12:11:36Z

Codecov Report

Merging #103 into master will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #103      +/-   ##
==========================================
+ Coverage   99.14%   99.15%   +0.01%     
==========================================
  Files          21       21              
  Lines         937      950      +13     
  Branches      173      173              
==========================================
+ Hits          929      942      +13     
  Misses          7        7              
  Partials        1        1

Impacted Files	Coverage Δ
h11/tests/test_connection.py	`100.00% <ø> (ø)`
h11/tests/test_io.py	`100.00% <ø> (ø)`
h11/_connection.py	`100.00% <100.00%> (ø)`
h11/_headers.py	`100.00% <100.00%> (ø)`
h11/_writers.py	`88.57% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1540843...a0eaaf8. Read the comment docs.

sigmavirus24 · 2020-08-25T12:35:07Z

h11/_headers.py

+    def __eq__(self, other):
+        return list(self) == other
+
+    def raw(self):


I'm curious about this. It seems it would return a list of the triplet where elsewhere raw implies the name that came over the wire. This should be clearer about what it is returning

I also wonder about the wisdom of having a headers sequence without structured single-header data

I also wonder about the wisdom of having a headers sequence without structured single-header data

I'm not quite sure what you mean here. Are you saying "If you're returning a three-tuple from this interface then let's have it use a named-tuple" or something else?

Perhaps a marginally different interface for us to expose here would not be .raw() -> (<lowercase name>, <raw name>, <value>), but instead expose just .raw_items() -> (<raw name>, <value>)

Perhaps that'd address the naming/intent slightly better?

I'm thinking of having a class that encapsulates a Header and then having the collection of Headers use that. A namedtuple could work fine as well. All that said, I get that tuples may be a smidge faster and that these are internal implementation details. Speaking from having worked on header collection objects in urllib3 in the past, these tuples can drive maintainers to pull out their hair (as well as future folks trying to update/extend the behaviour).

Right, for something that remains compatible with the existing API I think the options here are...

A custom Headers sequence, that exposes the extra information in a .raw_items() interface or similar, that returns a two-tuple of (case-sensitive-name, value) for usages that require the raw casing info.

A custom Headers sequence, the returns Header instances, that can iterate as two-tuples, but also expose .name, .case_sensitive_name and .value attributes, which are available for usages that require the raw casing info.

Or some variation on those. (Eg. this PR which currently has .raw() returning the three-tuple of info.)
Personally I'm fairly agnostic, as both the above options seem reasonable enough. The Header case has the most extra overhead, since it creates and accesses a per-header instance rather than the plain tuple, while I wouldn't expect the .raw_items() approach to introduce anything really noticeable, but I could run through some timings on each of the options to help better inform our options.

h11/_headers.py

Kriechi · 2020-08-25T15:40:25Z

Just as a comment or maybe for additional inspiration:
This is how we solved a similar problem in the mitmproxy project:
https://github.com/mitmproxy/mitmproxy/blob/master/mitmproxy/net/http/headers.py

tomchristie · 2020-09-08T09:58:52Z

Okay, based on the feedback here's a slight retake on this, which I think exposes a neater interface and has a pretty minimal change footprint.

After this pull request...

h11 preserves header casing on the wire.
event.headers changes marginally in that it exposes a list of Header instances, which are tuple-like (name, value) byte pairs, rather than actual tuples.
The Header items can also be accessed explicitly, with header.name, header.value. For use-cases where you need the original case-sensitive naming, header.raw_name is also available.

Internally...

set_comma_header no longer "expects name to be lowercase bytes". It keeps the same case-insensitive behaviour, while preserving whatever casing information is passed.
get_comma_header no longer "expects name to be lowercase bytes". It keeps the same case-insensitive behaviour.
Test cases use title casing whenever calling into set_comma_header/get_comma_header.

njsmith · 2020-09-08T10:33:38Z

What do you get from PYTHONPATH=. python bench/benchmarks/benchmarks.py before and after this change?

tomchristie · 2020-09-08T10:35:40Z

Ah thanks I'd just been looking into that, but I'd not seen the benchmark.

Just as a rough guideline, I've taken the following to get ideas of comparative performances of the different approaches here...

import h11
import timeit


def send_request():
    conn = h11.Connection(our_role=h11.CLIENT)
    headers = [
        (b'Accept', b'*/*'),
        (b'Accept-Encoding', b'gzip, deflate'),
        (b'Connection', b'keep-alive'),
        (b'Host', b'www.example.org'),
        (b'User-Agent', b'HTTPie/2.2.0'),
    ]
    request = h11.Request(method="GET", target="/", headers=headers)
    bytes_to_send = conn.send(request)


print(timeit.timeit(send_request, number=100000))

Which comes out with...

Plain tuples, case not preserved: ~3.5 seconds
Headers as a sequence, exposing .raw_items interface: ~4.1 seconds
List of Header instances, exposing .raw_name interface: ~4.5 seconds

I'll take a look at the proper benchmarks now...

tomchristie · 2020-09-08T10:38:40Z

Existing...

$ PYTHONPATH=. venv/bin/python bench/benchmarks/benchmarks.py
7389.5 requests/sec
7457.9 requests/sec
7451.1 requests/sec
7445.4 requests/sec
7434.2 requests/sec
7428.8 requests/sec
7447.1 requests/sec

Headers as a sequence...

$ PYTHONPATH=. venv/bin/python bench/benchmarks/benchmarks.py
6393.3 requests/sec
6404.6 requests/sec
6369.1 requests/sec
6346.9 requests/sec
6372.4 requests/sec
6388.1 requests/sec
6403.9 requests/sec

Header instances...

$ PYTHONPATH=. venv/bin/python bench/benchmarks/benchmarks.py
5851.9 requests/sec
5897.9 requests/sec
5894.1 requests/sec
5873.7 requests/sec
5875.9 requests/sec
5871.4 requests/sec
5903.8 requests/sec

njsmith · 2020-09-08T10:39:17Z

The benchmark I cited isn't terribly clever, but it does exercise a full request/response cycle with some realistic headers.

tomchristie · 2020-09-14T13:33:55Z

Closing this off in favour of #104

tomchristie added 5 commits August 25, 2020 12:01

Headers becomes a type of Sequence, rather than a raw, mutable list.

2b94ba4

Write headers from headers.raw(), but continue to use lower-casing

ad75a8c

Use title casing in get_comma_header/set_comma_header usages

3641ffd

Preserve header casing in I/O bytes

5cac0b4

Python 2 support

edf88cb

tomchristie mentioned this pull request Aug 25, 2020

Key headers are lowercased before sent ? encode/httpx#538

Closed

sigmavirus24 reviewed Aug 25, 2020

View reviewed changes

h11/_headers.py Outdated Show resolved Hide resolved

Fix __getitem__ implementation

0356d32

tomchristie mentioned this pull request Sep 6, 2020

How to work with session adapters as in requests? encode/httpx#1263

Closed

tomchristie added 3 commits September 8, 2020 10:39

Preserve header casing using tuple-like Header instances

054f0ca

Minimize change footprint

8cafe02

Minimize change footprint

a0eaaf8

tomchristie mentioned this pull request Sep 8, 2020

Preserve header casing. Take two. 🎬 #104

Merged

tomchristie closed this Sep 14, 2020

tomchristie deleted the preserve-header-casing branch September 14, 2020 13:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve header casing #103

Preserve header casing #103

tomchristie commented Aug 25, 2020 •

edited

Loading

codecov bot commented Aug 25, 2020 •

edited

Loading

sigmavirus24 Aug 25, 2020

sigmavirus24 Aug 25, 2020

tomchristie Aug 29, 2020 •

edited

Loading

sigmavirus24 Aug 30, 2020

tomchristie Sep 1, 2020

Kriechi commented Aug 25, 2020

tomchristie commented Sep 8, 2020 •

edited

Loading

njsmith commented Sep 8, 2020

tomchristie commented Sep 8, 2020 •

edited

Loading

tomchristie commented Sep 8, 2020

njsmith commented Sep 8, 2020

tomchristie commented Sep 14, 2020

Preserve header casing #103

Preserve header casing #103

Conversation

tomchristie commented Aug 25, 2020 • edited Loading

codecov bot commented Aug 25, 2020 • edited Loading

Codecov Report

sigmavirus24 Aug 25, 2020

Choose a reason for hiding this comment

sigmavirus24 Aug 25, 2020

Choose a reason for hiding this comment

tomchristie Aug 29, 2020 • edited Loading

Choose a reason for hiding this comment

sigmavirus24 Aug 30, 2020

Choose a reason for hiding this comment

tomchristie Sep 1, 2020

Choose a reason for hiding this comment

Kriechi commented Aug 25, 2020

tomchristie commented Sep 8, 2020 • edited Loading

njsmith commented Sep 8, 2020

tomchristie commented Sep 8, 2020 • edited Loading

tomchristie commented Sep 8, 2020

njsmith commented Sep 8, 2020

tomchristie commented Sep 14, 2020

tomchristie commented Aug 25, 2020 •

edited

Loading

codecov bot commented Aug 25, 2020 •

edited

Loading

tomchristie Aug 29, 2020 •

edited

Loading

tomchristie commented Sep 8, 2020 •

edited

Loading

tomchristie commented Sep 8, 2020 •

edited

Loading