Skip to content
This repository has been archived by the owner on May 18, 2021. It is now read-only.

Intermittent SAML and 2FA Push Notification Timeouts from Okta #298

Closed
sdann opened this issue Sep 28, 2020 · 14 comments
Closed

Intermittent SAML and 2FA Push Notification Timeouts from Okta #298

sdann opened this issue Sep 28, 2020 · 14 comments

Comments

@sdann
Copy link
Contributor

sdann commented Sep 28, 2020

Posting to see if any other users are having similar issues. Since 9/24/20 several of our users are getting HTTP timeouts waiting for responses from Okta. The behavior is inconsistent, but it takes 3 forms:

  1. Initial SAML Authn never responds. No MFA notification.

    getting creds via SAML: Failed to authenticate with okta. If your credentials have changed, use 'aws-okta add': &url.Error{Op:"Post", URL:"https://convoy.okta.com/api/v1/authn", Err:(*http.httpError)(0xc0002fc0a0)}

  2. MFA prompt, but Push notification never shows up on mobile device. Eventual error:

    getting creds via SAML: Post "https://[company].okta.com/api/v1/authn/factors/[id]/verify?rememberDevice=true": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

  3. MFA notification arrives on phone and I complete it but aws-okta doesn’t seem to register it and it fails with timeout.

    getting creds via SAML: Failed authn verification for okta. Err: Post "https://[company].okta.com/api/v1/authn/factors/[id]/verify": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Right now, my assumption is this is a network level issue on Okta's servers, but it's only affecting aws-okta. Okta GUI authentication works fine.

@nickatsegment
Copy link
Contributor

Hm, haven't heard anything from our users at Segment.

It might be the case that the undocumented API we're using has changed in a subtle way that we're not hitting, like #293 but we're not hitting the same codepaths somehow. Your only recourse is to compare HTTP flows between the two and reverse engineer it :(

Could also be network gremlins, as you suggest

@brandt
Copy link
Contributor

brandt commented Sep 29, 2020

Several folks here have been running into this problem intermittently.

#299 does appear to fix it.

@sdann
Copy link
Contributor Author

sdann commented Sep 30, 2020

I can confirm that HTTP2 over TLS is present in all of our failure cases, while HTTP1 over TLS is negotiated in all of my success repos.

In version 1.0.4, the aws-okta HTTP client offers that it can speak h2 or h1 in the Client Hello message.

Screen Shot 2020-09-29 at 8 53 16 PM

Even though the Get implementation explicitly requests HTTP/1.1.

Proto: "HTTP/1.1",

The Okta server Server Hello message picks h2.

The fix in #299 to explicitly set h1 in the TLS configuration matches the current code intent, in addition to fixing the issue.

Chippiewill pushed a commit to Chippiewill/aws-okta that referenced this issue Sep 30, 2020
Previously the content length was being calculated based off of an empty
uninitialised byte array. With this change it's now calculated off of the
actual data array used as the body.

Setting the content length to 0 seemed to be causing an issue with
recent changes to Okta's infrastructure as noticed in segmentio#298.
@Chippiewill
Copy link
Contributor

I started noticing this problem yesterday and after poking some debug prints into aws-okta I noticed that it was always sending a Content-Length of 0. This turns out to be a bug in the way it's calculated. It's currently based on the length of an object that's never initialized.

https://github.com/segmentio/aws-okta/blob/master/lib/okta.go#L599

PR raised: #300

@ngodec
Copy link

ngodec commented Sep 30, 2020

Noticing this a lot today, intermittently across different AWS profiles and Okta instances.
Present in aws-okta versions: 0.26.3, 1.0.1, 1.0.4

@nickatsegment
Copy link
Contributor

How consistent is this for folks? I.e., what percentage of auth attempts timeout? I haven't been able to repro personally, but have had a handful of users at Segment say they are affected.

@Chippiewill
Copy link
Contributor

For me personally I've had it be super consistent for an hour (100%) and then stop entirely for hours although there's been a bit of inconsistentency on which bit of the flow times out (first request vs MFA push etc). Similar reports from many of my colleagues (but equally some colleagues entirely unaffected). On any occasion it's broken I've had saml2aws work entirely fine (which is using the same underlying API) and even manually stepped through the sign-in flow fine with curl.

@nickatsegment
Copy link
Contributor

nickatsegment commented Sep 30, 2020

Having some trouble with our CI publishing pipeline, but the tag v1.0.5 is there. That's enough for Homebrew. If somebody else could submit a PR there, that'd be great (we don't use it).

I'm working on getting the binaries and packages published to our GH Releases.

Update: it's just the linux binary that's failed to publish. The RPMs and DEBs are live on packagecloud.

@nickatsegment
Copy link
Contributor

Big shoutout to @Chippiewill and @mvallaly-rally for their PRs.

@Ngibb
Copy link

Ngibb commented Sep 30, 2020

Thanks for getting this fixed so fast.

@richadams
Copy link

I raised Homebrew/homebrew-core#61790 to bump the version in Homebrew.

@sdann
Copy link
Contributor Author

sdann commented Sep 30, 2020

Amazing community effort to get this bug root caused and fixed quickly. @nickatsegment, even with partial deprecation, you've got a large and caring userbase.

@seanorama
Copy link

Great work! Any progress on getting the assets pushed in to the 1.0.5 github release?

@nickatsegment
Copy link
Contributor

@seanorama See #301 (comment). TLDR: no.

arohter pushed a commit to TiVo/aws-okta that referenced this issue Nov 21, 2020
arohter added a commit to TiVo/aws-okta that referenced this issue Nov 21, 2020
* Calculate OktaClient Content-Length correctly (segmentio#300)

Fixes: segmentio#298

* Update issue templates

* Fix cred process expiration (segmentio#303)

* Added Ubuntu 2020 (Focal) to Makefile.release (segmentio#304)

* disable github releases (currently broken) (segmentio#305)

Co-authored-by: Will Gardner <willg@rdner.io>
Co-authored-by: Nick Irvine <nick@segment.com>
Co-authored-by: Zoltán Reegn <zoltan.reegn@gmail.com>
Co-authored-by: Yossi Eliaz <zozo123@users.noreply.github.com>
arohter added a commit to TiVo/aws-okta that referenced this issue Feb 19, 2021
* Calculate OktaClient Content-Length correctly (segmentio#300)

Fixes: segmentio#298

* Update issue templates

* Fix cred process expiration (segmentio#303)

* Added Ubuntu 2020 (Focal) to Makefile.release (segmentio#304)

* disable github releases (currently broken) (segmentio#305)

* Update AWS Go SDK To v1.25.35 (segmentio#307)

Fixes STS regional endpoint support.

* Add STS Regional Endpoint Support To Other STS Clients (segmentio#308)

* Update keyring to v1.1.6 (segmentio#309)

Recent versions of kwallet have removed the old support for the kde4
compatible kwallet dbus interface. This means newer kde5 based
OS installs (e.g. kubuntu 20.04) can no longer use the kwallet backend
with aws-okta.

This was fixed upstream in the keyring lib back in 2019 but the
dependency hasn't been bumped since then.

Co-authored-by: Will Gardner <willg@rdner.io>
Co-authored-by: Nick Irvine <nick@segment.com>
Co-authored-by: Zoltán Reegn <zoltan.reegn@gmail.com>
Co-authored-by: Yossi Eliaz <zozo123@users.noreply.github.com>
Co-authored-by: Andrew Babichev <andrew.babichev@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants