Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement SSL_MODE_AUTO_RETRY #1333

Merged
merged 15 commits into from
Jan 19, 2024

Conversation

WillChilds-Klein
Copy link
Contributor

@WillChilds-Klein WillChilds-Klein commented Dec 1, 2023

Issues:

Addresses CryptoAlg-2136

Description of changes:

Notes

This change implements SSL_MODE_AUTO_RETRY, which was previously ignored. When this option (off by default) is set, SSL_get_errors behavior hews a little more closely to OpenSSL when processing a return code of 0 (i.e. empty I/O). With this option set, we allow error processing for empty reads if we're in a retryable state. With this option unset, we treat (most) 0-valued return codes as SSL_ERROR_SYSCALL, using it as a sort-of alias for EOF in the underlying transport. Internal calls to SSL_get_error (such as those in bio_ssl.cc) then observe the SSL_ERROR_SYSCALL and fail to process potential retries.

OpenSSL implements this mode differently, only checking its value in its main read loop. Neither OpenSSL 1.1.1 nor OpenSSL 3.x return SSL_ERROR_SYSCALL early for empty reads, instead allowing them to be processed for retryable state. BoringSSL appears to have diverged from this behavior in 2015, but that commit was just a revert of a regression committed ~1 month prior. Their previous implementation closely resembles that of OpenSSL 1.0.2, guarding retryable error processing with i < 0 checks. SSL_get_error in OpenSSL 1.1.1 and OpenSSL 3.0, however, have no such guards. This change resembles the latter two.

Why is it important that we process 0-valued return codes more fully? The recommended SSL IO functions (SSL_read, SSL_write, etc.) are, by contract, allowed to return negative numbers to indicate an error or non-success condition. However, their size_t-clean counterparts (SSL_read_ex, SSL_write_ex, etc.) must, by contract, only return 0 or 1. This means that passing the return code from the various _ex functions to SSL_get_error loses error reporting information, resulting in inconsistent handling of otherwise identical error cases between the two function families.

Another solution to the _ex functions' loss of error reporting information would be to change the contract of those functions to allow negative return values, directly returning the return code from internal calls to SSL_read or SSL_write. I decided not to take this approach as it would break callers who do not account for negative return values in their error handling.

Relevant Links

Call-outs:

  • n/a

Testing:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and the ISC license.

@codecov-commenter
Copy link

codecov-commenter commented Dec 1, 2023

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (d9caaca) 76.85% compared to head (85c669f) 76.84%.

Files Patch % Lines
ssl/ssl_lib.cc 60.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1333      +/-   ##
==========================================
- Coverage   76.85%   76.84%   -0.02%     
==========================================
  Files         425      425              
  Lines       71498    71502       +4     
==========================================
- Hits        54948    54943       -5     
- Misses      16550    16559       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@WillChilds-Klein WillChilds-Klein force-pushed the libssl-handle-EAGAIN branch 2 times, most recently from 9607aca to 06f9cc4 Compare December 6, 2023 20:43
@WillChilds-Klein WillChilds-Klein changed the title Handle EAGAIN in SSL_get_error Don't return SSL_ERROR_SYSCALL for potential EOF Dec 6, 2023
@WillChilds-Klein WillChilds-Klein force-pushed the libssl-handle-EAGAIN branch 5 times, most recently from 82eae99 to 6065eed Compare December 13, 2023 17:26
@samuel40791765
Copy link
Contributor

samuel40791765 commented Dec 14, 2023

This seemed relevant to the test failures I'm running into with Bind9, so I tried applying the change to see if it would resolve my issues. It results in the tests from Bind9 hanging forever though (the tests used to directly fail) :(.
I'm pasting this code block from bind9 here, it seems they were dependent on detecting the original behaviour:

@WillChilds-Klein WillChilds-Klein force-pushed the libssl-handle-EAGAIN branch 4 times, most recently from 0ddfd60 to 6dd64f7 Compare December 20, 2023 18:24
@WillChilds-Klein WillChilds-Klein force-pushed the libssl-handle-EAGAIN branch 3 times, most recently from e335d43 to 3531382 Compare January 4, 2024 15:44
@WillChilds-Klein WillChilds-Klein marked this pull request as ready for review January 4, 2024 16:29
@WillChilds-Klein WillChilds-Klein requested a review from a team as a code owner January 4, 2024 16:29
@WillChilds-Klein WillChilds-Klein enabled auto-merge (squash) January 4, 2024 16:30
@justsmth justsmth self-requested a review January 4, 2024 16:34
@WillChilds-Klein WillChilds-Klein changed the title Don't return SSL_ERROR_SYSCALL for potential EOF Guard EOF "error" return in SSL_get_error Jan 5, 2024
ssl/ssl_test.cc Outdated Show resolved Hide resolved
ssl/ssl_test.cc Outdated Show resolved Hide resolved
ret = SSL_read(client.get(), buf, sizeof(buf));
EXPECT_EQ(ret, (int) sizeof(buf));
EXPECT_EQ(SSL_get_error(client.get(), ret), SSL_ERROR_NONE);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you tried to read another byte here what do you get?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we get SSL_ERROR_WANT_READ from SSL_get_error, but this time with a ret val of -1 (as opposed to 0 retval on ssl_test.cc L10502). SSL_ERROR_WANT_READ is retryable, that doesn't seem right....

moving PR back to draft pending further investigation.

Copy link
Contributor Author

@WillChilds-Klein WillChilds-Klein Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, I confirmed that even before this change, trying to read additional bytes after exhausting the rbio resulted in SSL_ERROR_WANT_READ. so, it looks like this change doesn't affect that specific behavior.

@WillChilds-Klein WillChilds-Klein marked this pull request as draft January 10, 2024 00:40
@WillChilds-Klein WillChilds-Klein added the automation-exempt This issue or pull request is exempt from automatic closure. label Jan 11, 2024
justsmth
justsmth previously approved these changes Jan 16, 2024
ssl/ssl_lib.cc Outdated Show resolved Hide resolved
@WillChilds-Klein
Copy link
Contributor Author

This seemed relevant to the test failures I'm running into with Bind9, so I tried applying the change to see if it would resolve my issues. It results in the tests from Bind9 hanging forever though (the tests used to directly fail) :(. I'm pasting this code block from bind9 here, it seems they were dependent on detecting the original behaviour:

* https://gitlab.isc.org/isc-projects/bind9/-/blob/4330014fce5ab2ac1a6e58cd330762318a34f8d9/lib/isc/netmgr/tlsstream.c#L731-750

@samuel40791765 -- from what I can tell, bind9 does not set SSL_MODE_AUTO_RETRY, so this change should not affect that integration.

@WillChilds-Klein WillChilds-Klein enabled auto-merge (squash) January 18, 2024 22:27
@WillChilds-Klein WillChilds-Klein removed the automation-exempt This issue or pull request is exempt from automatic closure. label Jan 18, 2024
@WillChilds-Klein WillChilds-Klein merged commit 208327e into aws:main Jan 19, 2024
35 of 36 checks passed
WillChilds-Klein added a commit to WillChilds-Klein/aws-lc that referenced this pull request Jan 19, 2024
This change implements `SSL_MODE_AUTO_RETRY`, which was previously ignored. When this option (off by default) is set, `SSL_get_error`s behavior hews a little more closely to OpenSSL when processing a return code of 0 (i.e. empty I/O). With this option set, we allow error processing for empty reads if we're in a retryable state. With this option unset, we treat (most) 0-valued return codes as `SSL_ERROR_SYSCALL`, using it as a sort-of alias for EOF in the underlying transport. Internal calls to `SSL_get_error` (such as [those in `bio_ssl.cc`](https://github.com/aws/aws-lc/blob/main/ssl/bio_ssl.cc#L29)) then observe the  `SSL_ERROR_SYSCALL` and fail to process potential retries.

OpenSSL implements this mode differently, only checking its value in its [main read loop](https://github.com/openssl/openssl/blob/398011848468c7e8e481b295f7904afc30934217/ssl/record/rec_layer_s3.c#L992). Neither OpenSSL [1.1.1][6] nor [OpenSSL 3.x][10] return `SSL_ERROR_SYSCALL` early for empty reads, instead allowing them to be processed for retryable state. BoringSSL [appears to have diverged][7] from this behavior in 2015, but that commit was just a revert of a regression committed ~1 month prior. Their [previous implementation][18] closely resembles [that of OpenSSL 1.0.2][13], guarding retryable error processing with `i < 0` checks. `SSL_get_error` in [OpenSSL 1.1.1][12] and [OpenSSL 3.0][11], however, have no such guards. This change resembles the latter two.

Why is it important that we process 0-valued return codes more fully? The recommended SSL IO functions (`SSL_read`, `SSL_write`, etc.) are, by contract, allowed to return negative numbers to indicate an error or non-success condition. However, their `size_t`-clean counterparts (`SSL_read_ex`, `SSL_write_ex`, etc.) must, by contract, only return 0 or 1. This means that passing the return code from the various `_ex` functions to `SSL_get_error` loses error reporting information, resulting in inconsistent handling of otherwise identical error cases between the two function families.

Another solution to the `_ex` functions' loss of error reporting information would be to change the contract of those functions to allow negative return values, directly returning the return code from internal calls to `SSL_read` or `SSL_write`. I decided not to take this approach as it would break callers who do not account for negative return values in their error handling.

- [OpenSSL 1.0.2 documentation for `SSL_get_error`][3]
- [OpenSSL 1.1.1 documentation for `SSL_get_error`][4]
- [OpenSSL 3.0 documentation for `SSL_get_error`][5]
- [OpenSSL 1.0.2 implementation for `SSL_get_error`][13]
- [OpenSSL 1.1.1 implementation for `SSL_get_error`][12]
- [OpenSSL 3.0 implementation for `SSL_get_error`][11]
- [OpenSSL master (3.2) implementation for `SSL_get_error`][10]
- [OpenSSL 1.1.1 commit dropping `ret < 0` guards for hint processing][6]
- [Unresolved OpenSSL issue for distinguishing EOF from failure in `BIO_read_ex`][1]
- [BoringSSL commit introducing `SSL_ERROR_SYSCALL` for `ret == 0` case][7]
- [BoringSSL Issue 503: Stop preserving arbitrary BIO error return values][2]
- [BoringSSL Issue 24: Sanitize SSL and BIO return codes.][15]
- [OpenSSL 3.0 commit returning `SSL_ERROR_SSL`, add reason to err stack on EOF][14]
- [OpenSSL 3.0 commit introducing `SSL_OP_IGNORE_UNEXPECTED_EOF`  option][8]
- [CPython commit from @davidben removing CPython's use of `SSL_OP_IGNORE_UNEXPECTED_EOF`][9]
- [CPython commit switching from `SSL_read` and `SSL_write` to `SSL_read_ex` and `SSL_write_ex`][16]
- [OpenSSL 1.1.1 documentation for `SSL_MODE_AUTO_RETRY`][17]

[1]: openssl/openssl#8208
[2]: https://bugs.chromium.org/p/boringssl/issues/detail?id=503
[3]: https://www.openssl.org/docs/man1.0.2/man3/SSL_get_error.html
[4]: https://www.openssl.org/docs/man1.1.1/man3/SSL_get_error.html
[5]: https://www.openssl.org/docs/man3.0/man3/SSL_get_error.html
[6]: openssl/openssl@8051ab2
[7]: google/boringssl@9a38e92
[8]: openssl/openssl@09b90e0
[9]: python/cpython#95495
[10]: https://github.com/openssl/openssl/blob/master/ssl/ssl_lib.c#L4601
[11]: https://github.com/openssl/openssl/blob/openssl-3.0/ssl/ssl_lib.c#L3839
[12]: https://github.com/openssl/openssl/blob/OpenSSL_1_1_1-stable/ssl/ssl_lib.c#L3617
[13]: https://github.com/openssl/openssl/blob/OpenSSL_1_0_2-stable/ssl/ssl_lib.c#L2709
[14]: openssl/openssl@d924dbf
[15]: https://bugs.chromium.org/p/boringssl/issues/detail?id=24
[16]: python/cpython@89d1550
[17]: https://www.openssl.org/docs/man1.1.1/man3/SSL_CTX_set_mode.html
[18]: google/boringssl@fcf2583#diff-38e1eecb6013c2fe8bf194e5020ce1afd6e31c371d7849e041464ad822fe745bL2271
dougch pushed a commit to dougch/aws-lc that referenced this pull request Jan 30, 2024
# Notes

This change implements `SSL_MODE_AUTO_RETRY`, which was previously ignored. When this option (off by default) is set, `SSL_get_error`s behavior hews a little more closely to OpenSSL when processing a return code of 0 (i.e. empty I/O). With this option set, we allow error processing for empty reads if we're in a retryable state. With this option unset, we treat (most) 0-valued return codes as `SSL_ERROR_SYSCALL`, using it as a sort-of alias for EOF in the underlying transport. Internal calls to `SSL_get_error` (such as [those in `bio_ssl.cc`](https://github.com/aws/aws-lc/blob/main/ssl/bio_ssl.cc#L29)) then observe the  `SSL_ERROR_SYSCALL` and fail to process potential retries.

OpenSSL implements this mode differently, only checking its value in its [main read loop](https://github.com/openssl/openssl/blob/398011848468c7e8e481b295f7904afc30934217/ssl/record/rec_layer_s3.c#L992). Neither OpenSSL [1.1.1][6] nor [OpenSSL 3.x][10] return `SSL_ERROR_SYSCALL` early for empty reads, instead allowing them to be processed for retryable state. BoringSSL [appears to have diverged][7] from this behavior in 2015, but that commit was just a revert of a regression committed ~1 month prior. Their [previous implementation][18] closely resembles [that of OpenSSL 1.0.2][13], guarding retryable error processing with `i < 0` checks. `SSL_get_error` in [OpenSSL 1.1.1][12] and [OpenSSL 3.0][11], however, have no such guards. This change resembles the latter two. 

Why is it important that we process 0-valued return codes more fully? The recommended SSL IO functions (`SSL_read`, `SSL_write`, etc.) are, by contract, allowed to return negative numbers to indicate an error or non-success condition. However, their `size_t`-clean counterparts (`SSL_read_ex`, `SSL_write_ex`, etc.) must, by contract, only return 0 or 1. This means that passing the return code from the various `_ex` functions to `SSL_get_error` loses error reporting information, resulting in inconsistent handling of otherwise identical error cases between the two function families.

Another solution to the `_ex` functions' loss of error reporting information would be to change the contract of those functions to allow negative return values, directly returning the return code from internal calls to `SSL_read` or `SSL_write`. I decided not to take this approach as it would break callers who do not account for negative return values in their error handling.

# Relevant Links

- [OpenSSL 1.0.2 documentation for `SSL_get_error`][3]
- [OpenSSL 1.1.1 documentation for `SSL_get_error`][4]
- [OpenSSL 3.0 documentation for `SSL_get_error`][5]
- [OpenSSL 1.0.2 implementation for `SSL_get_error`][13]
- [OpenSSL 1.1.1 implementation for `SSL_get_error`][12]
- [OpenSSL 3.0 implementation for `SSL_get_error`][11]
- [OpenSSL master (3.2) implementation for `SSL_get_error`][10]
- [OpenSSL 1.1.1 commit dropping `ret < 0` guards for hint processing][6]
- [Unresolved OpenSSL issue for distinguishing EOF from failure in `BIO_read_ex`][1]
- [BoringSSL commit introducing `SSL_ERROR_SYSCALL` for `ret == 0` case][7]
- [BoringSSL Issue 503: Stop preserving arbitrary BIO error return values][2]
- [BoringSSL Issue 24: Sanitize SSL and BIO return codes.][15]
- [OpenSSL 3.0 commit returning `SSL_ERROR_SSL`, add reason to err stack on EOF][14]
- [OpenSSL 3.0 commit introducing `SSL_OP_IGNORE_UNEXPECTED_EOF`  option][8]
- [CPython commit from @davidben removing CPython's use of `SSL_OP_IGNORE_UNEXPECTED_EOF`][9]
- [CPython commit switching from `SSL_read` and `SSL_write` to `SSL_read_ex` and `SSL_write_ex`][16]
- [OpenSSL 1.1.1 documentation for `SSL_MODE_AUTO_RETRY`][17]



[1]: openssl/openssl#8208
[2]: https://bugs.chromium.org/p/boringssl/issues/detail?id=503
[3]: https://www.openssl.org/docs/man1.0.2/man3/SSL_get_error.html
[4]: https://www.openssl.org/docs/man1.1.1/man3/SSL_get_error.html
[5]: https://www.openssl.org/docs/man3.0/man3/SSL_get_error.html
[6]: openssl/openssl@8051ab2
[7]: google/boringssl@9a38e92
[8]: openssl/openssl@09b90e0
[9]: python/cpython#95495
[10]: https://github.com/openssl/openssl/blob/master/ssl/ssl_lib.c#L4601
[11]: https://github.com/openssl/openssl/blob/openssl-3.0/ssl/ssl_lib.c#L3839
[12]: https://github.com/openssl/openssl/blob/OpenSSL_1_1_1-stable/ssl/ssl_lib.c#L3617
[13]: https://github.com/openssl/openssl/blob/OpenSSL_1_0_2-stable/ssl/ssl_lib.c#L2709
[14]: openssl/openssl@d924dbf
[15]: https://bugs.chromium.org/p/boringssl/issues/detail?id=24
[16]: python/cpython@89d1550
[17]: https://www.openssl.org/docs/man1.1.1/man3/SSL_CTX_set_mode.html
[18]: google/boringssl@fcf2583#diff-38e1eecb6013c2fe8bf194e5020ce1afd6e31c371d7849e041464ad822fe745bL2271
samuel40791765 added a commit that referenced this pull request Feb 16, 2024
1. This adds an integration CI dimension for Bind9

2. Resolved "cmocka unit tests" for Bind9

	* Additional <openssl/asn1.h> import in <openssl/objects.h>: Bind
	depends on some ASN1 functions, but does not directly import the
	corresponding header. OpenSSL imports the asn1 header file in
	objects.h (which Bind is pulling these symbols from), so I've
	added the header file reference to objects.h.

	* SSL_get_error error anticipation fixing: There were several
	failures discovered to be related this, thanks to research done
	in Implement SSL_MODE_AUTO_RETRY #1333. The issue was pinned down
	the check implemented in google/boringssl@9a38e92. This check used
	to exist before the final return of SSL_get_error in OpenSSL.
	Upstream moved this earlier in the function with
	google/boringssl@fcf2583. However, much of the functions guards for
	i < 0 checks have been removed since OpenSSL 1.1.1, so the early
	logic no longer applies.
	This check has evolved into SSL_ERROR_ZERO_RETURN in our code.
	Moving the check further down helps us gain better parity with
	OpenSSL 1.1.1. Doing so passes the bind test failures for
	proxystream_test, tls_test, and doh_test. This also happens to help
	our integration with CPython, so I've reconfigured that patch.
	We actually already use SSL_AUTO_RETRY by default in AWS-LC. The
	recent change mentioned in the point above surrounding the flag
	(208327e) was just to make some of the errors consistent in CPython
	when the flag was used. I've reverted the special behavior
	surrounding it since it should no longer be needed.

	* Assertion for SSL_set_shutdown: The assertion was added in
	63006a9, where it’s stated that we didn’t want SSL_set_shutdown
	messing up the state machine. This assertion is causing failures
	in tlsdns_test for Bind9, so it appears that we'll have to remove
	this to gain better OpenSSL parity.

3. Patch file needed for Bind seems to be slight bug in their build
configuration. This was from a fairly recent commit. We can look to
contribute this sometime soon.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants