Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Tracer should log warning instead of throwing when segment is not found #1356

Closed
danrivett opened this issue Mar 7, 2023 · 15 comments · Fixed by #1370
Closed
Assignees
Labels
completed This item is complete and has been merged/shipped feature-request This item refers to a feature request for an existing or new utility tracer This item relates to the Tracer Utility

Comments

@danrivett
Copy link

danrivett commented Mar 7, 2023

Expected Behaviour

Today, for an unknown reason in one of our Lambdas in Production, the Tracer object was unable to retrieve the current Segment from X-Ray.

Because the Segment was attempted to be retrieved in the middy middleware at the beginning of every request (here) it caused every Lambda request to fail until we forced a cold-start of the Lambda, which resolved our problem.

It caused the request to fail because the Tracer throws an error if the Segment returned is undefined.

Unfortunately our Lambda was kept warm for over an hour before we tracked it down and forced a cold-start. In the meantime every request failed (see execution logs below).

My expected behaviour would be that if a Segment could not be retrieved, that it doesn't cause the whole Lambda invocation to fail, which it currently does when using the middy middleware.

Tracing should be done on a best-effort basis, if there is any problem tracing a request it should not cause the main request itself to fail. This particular scenario of getting the Segment is one specific case that could be improved to avoid failing the request, but it's quite possible there may be other code-paths that are vulnerable to causing the main request to fail, particularly if instrumented by the captureLambdaHandler middy middleware.

Current Behaviour

As mentioned above, the Tracer currently throws an error if the Segment returned is undefined, this causes the middy middleware to fail every request, since getSegment() is called in the captureLambdaHandlerBefore function (here).

Code snippet

You could wire in a dummy Tracing provider to return undefined from its getSegment() method, called here.

Steps to Reproduce

We don't know yet what caused aws-xray-sdk-core's getSegment() (here) to return undefined, but if you can simulate that, and use the captureLambdaHandler middy middleware, you will find that it causes the wrapped request to fail.

For us it caused the request to fail until the function was cold-started since getSegment() never started returning a Segment object.

Possible Solution

My suggestion would be to consider updating the Tracer.getSegment() method to log a warning and return a dummy SubSegment if a real segment couldn't be retrieved, similar to how it already does if tracing is disabled.

Something like as follows:

  public getSegment(): Segment | Subsegment {
    if (!this.isTracingEnabled()) {
      return new Subsegment('## Dummy segment');
    }

    const segment = this.provider.getSegment();

    if (segment == null) { // changed from === undefined to be slightly more robust in case null is returned somehow
      // throw new Error('Failed to get the current sub/segment from the context.');
      console.warn('Failed to get the current sub/segment from the context, so failing over to manually create one to allow the request to proceed.');
      return new Subsegment('## Fallback segment (as unable to retrieve existing Segment)');
    }
 
    return segment;
  }

AWS Lambda Powertools for TypeScript version

1.6.0

AWS Lambda function runtime

18.x

Packaging format used

Lambda Layers

Execution logs

2023-03-07T17:08:23.043Z	xxxxxxxxx-xxxx-xxxx-xxxxxxxxxxxxx	ERROR	Error: Failed to get the current sub/segment from the context.
    at Object.contextMissingLogError [as contextMissing] (/var/task/index.js:15505:23)
    at getSegment (/var/task/index.js:15569:49)
    at ProviderService.getSegment (/var/task/index.js:20115:51)
    at Tracer2.getSegment (/var/task/index.js:20539:39)
    at open (/var/task/index.js:20841:32)
    at captureLambdaHandlerBefore (/var/task/index.js:20852:11)
    at runMiddlewares (/var/task/index.js:24918:23)
    at async runRequest (/var/task/index.js:24876:5)

2023-03-07T17:08:23.043Z	xxxxxxxxx-xxxx-xxxx-xxxxxxxxxxxxx	ERROR	Error: Failed to get the current sub/segment from the context.
    at Object.contextMissingLogError [as contextMissing] (/var/task/index.js:15505:23)
    at getSegment (/var/task/index.js:15569:49)
    at ProviderService.getSegment (/var/task/index.js:20115:51)
    at Tracer2.getSegment (/var/task/index.js:20539:39)
    at Tracer2.addErrorAsMetadata (/var/task/index.js:20209:33)
    at captureLambdaHandlerError (/var/task/index.js:20867:18)
    at runMiddlewares (/var/task/index.js:24918:23)
    at runRequest (/var/task/index.js:24902:13)

2023-03-07T17:08:23.044Z	xxxxxxxxx-xxxx-xxxx-xxxxxxxxxxxxx	ERROR	Invoke Error 	{
    "errorType": "Error",
    "errorMessage": "Failed to get the current sub/segment from the context.",
    "originalError": {
        "errorType": "Error",
        "errorMessage": "Failed to get the current sub/segment from the context.",
        "stack": [
            "Error: Failed to get the current sub/segment from the context.",
            "    at Tracer2.getSegment (/var/task/index.js:20541:17)",
            "    at open (/var/task/index.js:20841:32)",
            "    at captureLambdaHandlerBefore (/var/task/index.js:20852:11)",
            "    at runMiddlewares (/var/task/index.js:24918:23)",
            "    at async runRequest (/var/task/index.js:24876:5)"
        ]
    },
    "stack": [
        "Error: Failed to get the current sub/segment from the context.",
        "    at Tracer2.getSegment (/var/task/index.js:20541:17)",
        "    at Tracer2.addErrorAsMetadata (/var/task/index.js:20209:33)",
        "    at captureLambdaHandlerError (/var/task/index.js:20867:18)",
        "    at runMiddlewares (/var/task/index.js:24918:23)",
        "    at runRequest (/var/task/index.js:24902:13)"
    ]
}
@danrivett danrivett added triage This item has not been triaged by a maintainer, please wait bug Something isn't working labels Mar 7, 2023
@dreamorosi dreamorosi self-assigned this Mar 7, 2023
@dreamorosi
Copy link
Contributor

Hi @danrivett thank you for reporting this, I'll review the report tomorrow morning (EMEA time zone) and write here my findings.

If needed after finding and applying a fix, I'm open to do an emergency release to avoid further outages.

@danrivett
Copy link
Author

danrivett commented Mar 7, 2023

Thanks for such a quick response @dreamorosi especially given your timezone.

For now we're looking at deploying we've deployed the following workaround to wrap Tracer.getSegment() based off my suggested fix in this ticket, so that should avoid the need for an emergency fix and allow more time to evaluate the best solution.

export class CustomTracer extends Tracer {
  public constructor(options?: TracerOptions) {
    super(options);

    this.getSegment = this.getSegment.bind(this);
    this.createFallbackSubSegment = this.createFallbackSubSegment.bind(this);
  }

  getSegment(): Segment | Subsegment {
    try {
      return super.getSegment();
    } catch (error) {
      console.error(
        'Failed to get the current sub/segment from the context, so failing over to manually create one to allow the request to proceed.',
        error
      );
      return this.createFallbackSubSegment();
    }
  }

  protected createFallbackSubSegment(): Subsegment {
    return new Subsegment(
      '## Fallback segment (as unable to retrieve existing Segment)'
    );
  }
}

Note: This external fix is slightly different from the one suggested above because I'm wrapping the whole Tracer.getSegment() method call which is currently throwing an error, rather than altering how that method responds to undefined being returned from this.provider.getSegment().

@dreamorosi
Copy link
Contributor

So, I've started looking into this and on the Powertools' side, I was able to reproduce the behavior with this unit test case:

test('when no subsegment is returned, everything blows up', async () => {

  // Prepare
  const tracer: Tracer = new Tracer();
  jest.spyOn(tracer.provider, 'getSegment')
    .mockImplementationOnce(() => undefined);
  const lambdaHandler: Handler = async (_event: unknown, _context: Context) => ({
    foo: 'bar'
  });
  const handler = middy(lambdaHandler).use(captureLambdaHandler(tracer));

  // Act & Assess
  await expect(handler({}, context)).rejects.toThrowError(
    Error('Failed to get the current sub/segment from the context.')
  );

});

Now, by looking at the logs in the OP, and looking at the call stack in them, it seems that the first getSegment call that failed was the one in the before lifecycle stage of the middleware (here), as opposed to the one linked in the OP which is in the after phase (see below, line annotated with /// HERE ///):

2023-03-07T17:08:23.043Z	xxxxxxxxx-xxxx-xxxx-xxxxxxxxxxxxx	ERROR	Error: Failed to get the current sub/segment from the context.
    at Object.contextMissingLogError [as contextMissing] (/var/task/index.js:15505:23)
    at getSegment (/var/task/index.js:15569:49)
    at ProviderService.getSegment (/var/task/index.js:20115:51)
    at Tracer2.getSegment (/var/task/index.js:20539:39)
    at open (/var/task/index.js:20841:32)
    at captureLambdaHandlerBefore (/var/task/index.js:20852:11) <--- /// HERE ///
    at runMiddlewares (/var/task/index.js:24918:23)
    at async runRequest (/var/task/index.js:24876:5)

This points to the fact that for some reason the X-Ray SDK returned undefined, which is strange.

I need to investigate a bit further to try to understand why that could have happened, but as far as I remember this shouldn't happen unless AWSXRay.getSegment (here) was called outside of the context of a function invocation. However the call stacks in the logs you shared don't seem to indicate that.

I agree that the behavior of throwing an error is not explicitly documented, and that in most cases the current behavior might not be the best/desired one. However, strictly speaking I'm not sure yet if this should be marked as bug, given that Tracer behaved as it was supposed to (aka throw an error when no segment is returned).

With that said, I'm open to discuss and explore the possibility to change this into a warning like you suggested, or to at least give users a way to decide the behavior similar to what the X-Ray SDK does with the setContextMissingStrategy method.

In the meanwhile, would you be open to share the final log of a few of these faulty executions? I'm looking to see, if possible, the report log and see if the tracing-related info were present. If you prefer to do so in a less public forum, you can also email them at aws-lambda-powertools-feedback@amazon.com.

@dreamorosi dreamorosi added tracer This item relates to the Tracer Utility discussing The issue needs to be discussed, elaborated, or refined and removed triage This item has not been triaged by a maintainer, please wait labels Mar 8, 2023
@dreamorosi
Copy link
Contributor

Would like to amend a section of my previous comment:

I need to investigate a bit further to try to understand why that could have happened, but as far as I remember this shouldn't happen unless AWSXRay.getSegment (here) was called outside of the context of a function invocation. However the call stacks in the logs you shared don't seem to indicate that.

After looking again at the logs you provided, and specifically at the call stack, we can see in the first line of each log error this line:

at Object.contextMissingLogError [as contextMissing] (/var/task/index.js:15505:23)

Based on this entry of the call stack, it seems that the root cause of the issue is the fact that both the X-Ray SDK and Tracer were not able to find any trace data. This usually happens when trying to retrieve a segment or perform actions on it outside of the context of an invocation.

To continue with the investigation, could you please also provide two more pieces of info:

  • Which version of the X-Ray SDK are your Lambda functions running? I'm asking this because in 3.4.1 the SDK changed its behavior around missing context from throwing a runtime error to logging (PR here). Based on the call stack it seems you might be using that version (for context, Powertools still pins the previous one)
  • Any chance you could also share the high level structure of your function, the exported handler (not its content), and how you're using the middleware (plural)? Based on what I discussed above it seems that somehow the middleware is running outside of the Lambda context, although it's very strange and shouldn't happen

More details on how the AWS X-Ray SDK (from now on referred as SDK) resolves the trace context/data, note that all this happens upstream of Powertools.

  1. The first time that the SDK is imported, an IFEE defined here is called.
  2. This function configures the SDK to works in Lambda by setting some configs.
  3. After that, a namespace and context are created, and the first facade segment is set
  4. Up to this point, the facade segment is not valid and it's a dummy one as we are still outside of the scope of the actual function handler

Now, the codepath that leads to the stack in your logs, as far as I know, can be triggered only by first calling one of these three methods of the SDK:

  • getSegment, which corresponds to this implementation
  • setSegment, which corresponds to this implementation
  • resolveSegment, implemented here, which you probably won't be using in userland and that is usually called by one or the other two under the hood

All these methods are related to segment and context and the reason why is that is because the SDK stores the current segment in this context object (from this package which acts as custom implementation/polyfill of Async Hook present in more modern Node.js versions).

In a "regular" execution, the codepath executed should be this, which essentially tells the SDK to create a new facade segment, this time using the info coming from the Trace data present in the environment.

If we look at the logic of these three methods, the error related to the contextMissing can be triggered only if said methods were called outside of the context of a lambda execution, i.e.

import { getSegment } from 'aws-xray-sdk-core`;

const segment = getSegment(); // <- This will trigger a context missing

export const handler = () => {};

I know the example above is extremely trivial and simplistic, but is there any chance that this could be happening somewhere in your code, either outside of the handler or in middleware that run before there any chance that

@danrivett
Copy link
Author

danrivett commented Mar 8, 2023

Thank you so much @dreamorosi for taking such a timely and detailed look into this bug report. It really is greatly appreciated.

It is still early on the west coast here, and we still have some actions remaining from the outage yesterday that I need to attend to first, but I will respond later today with more information that you requested. I will post what I can here, and the rest I will email.

However three brief updates:

  1. We aren't calling X-Ray directly, in fact I had to add the aws-xray-sdk-core package to our package.json file in order to code the workaround solution above and manually new up a Subsegment object.
  2. We do have other middleware installed, namely other Powertools middleware such as injectLambdaContext provided by the Logger, but also custom middleware.
  • As examples, some of these middleware translate errors thrown of a certain type (ClientError in our case) into being a return error object rather than throwing the error (this is because it's invoked by AppSync and we want to be able to return the equivalent of 400s without marking our Lambdas as in error when they work as designed)
  • Other middleware also has global input policing to ensure that certain fields match our rules (because AppSync doesn't allow for custom scalar types (such as these common ones) 🙏
  • Therefore it's possible and we believe did happen in our cases detected so far that another middleware layer threw an error. However there are many other invocations that this has happened that hasn't caused this reported issue, so if that is affecting this, it doesn't affect it very often for us.
  • And I want to be clear that although an error was thrown by another middleware for one request, the Tracer middleware continued to fail for all requests until we manually redeployed it an hour later and cold-started it.
    • The key thing here though is the error becomes permanent until the function is destroyed and a new one is cold started. All our subsequent functions that passed middleware checks failed with the reported error above.

My final point though, is that regardless of the above, I think there is a design decision that needs to be carefully thought through and made: should an error in any instrumentation middleware - such as this X-Ray one, but not limited to that - be severe enough to warrant not executing the wrapped underlying function successfully?

One can argue both sides to this, but as I mentioned above I view instrumentation as best-effort and auxillary, and shouldn't itself prevent the underlying function from executing if it encounters an error.

This could of course be configurable, and although I wouldn't recommend it, the default could be to block the execution of the underlying function if it encounters an error, as there are valid reasons for this. As I mentioned we have middleware with polices the input, so of course we want those errors to block execution of the underlying function. But I don't view the failure of tracing instrumentation as cause for blocking the execution of the underlying function. I would much rather log a warning and alert off of that for us to investigate.

So regardless of the root cause of why this.provider.getSegment() returned undefined in Tracer.getSegment(), I think a decision needs to be made and documented about whether the Powertools instrumentation middleware should try to run on a best-effort basis, and in that case it should not choose to throw errors that will prevent the underlying function from executing, unless absolutely necessary.

To put this in context, there was a real customer impact for over an hour yesterday because of the errors being thrown, I'd argue not tracing those requests and allowing them to proceed in this error scenario is a much better alternative, which is why I am putting so much effort in explaining my reasoning.

We can always and should work to fix the root cause of issues (that's part of my investigation today), but there will always be bugs and so I feel we have to decide if we want to design middleware that gives the underlying request the greatest chance of success and so tolerate certain error scenarios what shouldn't but can happen.

One final addendum: The workaround above seems to be working well for us, overnight we had 8 function invocations out of 12,000 which encountered not being able to resolve the Segment, but they failed over and didn't cause subsequent executions to fail until the next cold start.

x-ray-failovers

Interesting that those invocations were clustered too, which is something I'll look into. (Update: they aren't, see comment below)

Thanks again @dreamorosi your huge efforts on this library and helping customers is not unnoticed and very appreciated.

@dreamorosi
Copy link
Contributor

Thank you for the additional details, I just wanted to clarify that my investigation for a root cause is not to defer responsibility to another package or to your code but only to try to identify the actual issue that caused the outage, and try to understand if this is something that is inherent to Tracer, and if so, that should be fixed so that it doesn't affect other customers.

The above is separate and independent from the fact that as you say tracing throwing an error should not break the whole middleware stack. I agree with that and after looking at the X-Ray SDK, which recently changed their default behavior to just log when this happens, I think we should do the same.

I'm happy to move forward with this fix, however if you are available I'd still like to understand what caused the issue. I'm concerned that if we apply the fix and just log, we'll be patching the behavior (aka the error being thrown), but we wouldn't be fixing the actual issue that caused it. However I also understand that you might have other priorities or not be interested in dedicating time to this, so if that's the case I 100% understand and there's no expectation whatsoever from our side on you doing that.

@danrivett
Copy link
Author

danrivett commented Mar 8, 2023

Thanks @dreamorosi I definitely want to provide additional info to help determine the root cause - we ourselves want to also understand that for similar reasons. My gut is there is an edge condition somewhere, possibly in the X-Ray SDK, but could easily be outside that too, it's too early for me to draw many conclusions.

I've been continuing to investigate, and it's still ongoing, but I do have some additional information I've uncovered that I can share here below.

But first I want to clarify an incorrect observation I made above (that's the danger of reporting timely but prematurely) - those 8 function invocations were not clustered, and in fact were 4 function invocations and each error had two log entries that matched by my query (one from the log message from aws-xray-sdk-core itself, and the other from our failover logged warning).

I thought it was worth updating on that as those 4 cases were isolated and not clustered as I first thought.

@danrivett
Copy link
Author

danrivett commented Mar 8, 2023

The additional info I've found so far centre on the 4 failed-over requests overnight. I think it should be helpful to your investigation too.

All 4 of these showed the same pattern:

  1. We received a Lambda request which one custom middleware function threw an error policing input - in all 4 cases it was due to an empty id field being passed in, since AppSync currently can only police that a field is required, but not that it's non-empty (see the custom scalar type discussion above).
  2. A second middleware function intercepted the error thrown as it was of type ClientError (which as I mentioned above is a super class we use to allow us to return client error objects) and this middleware function returned an error object response, swallowing the error.

It's worth pointing out that this request finished successfully (with a client error object returned).

Here's what we logged for that first request. This is logged by the 2nd middleware layer which caught the client error and returned the error object:

{
    "cold_start": true,
    "function_arn": "xxx",
    "function_memory_size": 512,
    "function_name": "xxx",
    "function_request_id": "xxx",
    "level": "INFO",
    "message": "ClientError (InputError) caught ('Input.BusinessIdMissing'); returning error object to requestor.",
    "service": "xxx",
    "timestamp": "2023-03-08T01:20:35.328Z",
    "xray_trace_id": "xxx",
    "error": {
        "returnObject": {
            "message": "No business id received",
            "error": {
                "type": "Input.BusinessIdMissing",
                "code": "BusinessIdMissing",
                "category": "Input"
            },
            "errorInfo": {
                "fieldName": "businessId",
                "fieldValue": ""
            }
        },
        "class": "InputError",
        "stack": [
            "Error: No business id received",
            "at new InputError (/var/task/index.js:23155:5)",
            "at validateBusinessId (/var/task/index.js:23223:11)",
            "at /var/task/index.js:26810:23",
            "at runMiddlewares (/var/task/index.js:26698:23)",
            "at async runRequest (/var/task/index.js:26656:5)"
        ]
    }
}
  1. We next receive a subsequent Lambda request which has valid input and so no error is thrown by the middleware layer in 1. above.
  2. However the Tracer middleware (captureLambdaHandler) cannot identify the segment, throws an error, which is then caught by our wrapper (see CustomTracer above) and it fails over to a manually created SubSegment.
  3. This request then completes successfully.

Subsequent requests complete successfully too with no additional failover logged, but I'm assuming they're all using the failed-over SubSegment manually created in 4. but I can't prove that. That's my assumption, as before we deployed our wrapper, every subsequent request failed because of the error thrown by Tracer.getSegment().

Important Detail

In all 4 cases, the request in 1. above which failed due to a missing id happened on a cold start. I think this important as an error was thrown by one middleware layer before being caught by another (which I believe is permitable even if unusual) but this error thrown seems to have disrupted AWS X-Ray from being able to obtain the segment correctly.

The fact though that the first request on a cold start was an invalid one, seems to be a key reason why we didn't see Tracer.getSegment() failing after other invalid requests that happened on a warm start. My assumption here is that the Segment has already been initialized in memory so an invalid request doesn't prevent X-Ray retrieving the Segment.

But it's also odd that on the invalid request itself, the captureLambdaHandler doesn't seem to have a problem.

The order in which the middleware functions are added also appears to be contributing because we're currently adding the tracer after the custom middleware because tracer is optional for us (though currently is defaulting to enabled):

mHandler = mHandler
    .use(injectLambdaContext(logger, { clearState: true }))
    .use(verifyRequest(logger)) // <- Error thrown here
    .use(returnClientErrors(logger)); // <- Error caught here
}

if (tracer) {
  mHandler = mHandler.use(captureLambdaHandler(tracer)); // <- Tracer middleware added here
}

Looking at this now, my assumption is that on that first request captureLambdaHandler() never has its captureLambdaHandlerBefore function called, since a previous before function threw an error. Even though it's caught by returnClientErrors it would be caught by its onError function and so I assume that causes the rest of the onBefore middleware functions to not be executed?

I'm not sure if captureLambdaHandlerAfter or captureLambdaHandlerError is still called in that case? My assumption is captureLambdaHandlerAfter is, since returnClientErrors swallows the error thrown and returns an error object.

Either way, both captureLambdaHandlerAfter and captureLambdaHandlerError call close(), and possibly there's a problem in calling close() without first calling open() by captureLambdaHandlerBefore being skipped, but I'm really reaching beyond my understanding at this point.

That explanation doesn't make sense to me because close() also calls Tracer.getSegment() and so I'd assume that first invalid cold-start request would then fail when it was called, and it's not, that request succeeds in returning an error object back to the requestor!

Sorry for so much detail, but I'm not sure what is helpful or not.

I am going to investigate and test moving the captureLambdaHandler earlier to see if that avoids failing over, but I also think it would be beneficial to remove throwing an error in Tracer.getSegment() and warning and failing over instead.

I am also curious if there's a weakness in aws-xray-sdk-core or Powertools Tracer as I'm not sure why prior to us wrapping Tracer it was never able to retrieve the Segment in subsequent requests until it was cold-started again.

Given the above, I think it may be possible to recreate this by creating your own middleware that mimic verifyRequest() and returnClientErrors() to see if you can recreate the part where subsequent requests never are able to get the Segment.

Let me know if there's more info you'd like. Thanks @dreamorosi.

@dreamorosi
Copy link
Contributor

Hi Dan, this is great, and I appreciate you taking the time to share all these info.

I think I have enough to continue the investigation. I'd like to focus my attention to the behavior you have described here:

I'm not sure if captureLambdaHandlerAfter or captureLambdaHandlerError is still called in that case? My assumption is captureLambdaHandlerAfter is, since returnClientErrors swallows the error thrown and returns an error object.

Either way, both captureLambdaHandlerAfter and captureLambdaHandlerError call close(), and possibly there's a problem in calling close() without first calling open() by captureLambdaHandlerBefore being skipped, but I'm really reaching beyond my understanding at this point.

I think you are onto something here, however I need to simulate this middleware stack to confirm our understanding.

I'll continue my investigation tomorrow and report back here.


One note, not necessarily related to this issue/outage, but that I still want to share based on the code snippet you shared above:

mHandler = mHandler
    .use(injectLambdaContext(logger, { clearState: true }))
    .use(verifyRequest(logger)) // <- Error thrown here
    .use(returnClientErrors(logger)); // <- Error caught here
}

if (tracer) {
  mHandler = mHandler.use(captureLambdaHandler(tracer)); // <- Tracer middleware added here
}

If possible, I'd recommend refactoring this to something along these lines:

if (tracer) {
  mHandler = mHandler.use(captureLambdaHandler(tracer)); // <- Tracer middleware added here
}

mHandler = mHandler
    .use(injectLambdaContext(logger, { clearState: true }))
    .use(verifyRequest(logger)) // <- Error thrown here
    .use(returnClientErrors(logger)); // <- Error caught here

The reason why I suggest this is due to how Middy handles the middleware stack (described here). Given your original middleware stack above, the execution order would be like:

injectLambdaContext
verifyRequest
returnClientErrors
captureLambdaHandler
handler
captureLambdaHandler
returnClientErrors
verifyRequest
injectLambdaContext

Unless this was an explicit choice, this means that the captureLambdaHandler is tracing only the content of the actual handler and any time spent in the other middleware is excluded from the segments, but still part of the main Invocation segment generated by Lambda (referred to as facade in a previous message).

Regardless of the handler throwing or not, ideally you'd want the captureLambdaHandler to be the first and last middleware so that the whole execution is part of the segment. Additionally, if there's an error, with Tracer it would be added to the segment as metadata.

@danrivett
Copy link
Author

danrivett commented Mar 8, 2023

Thanks so much @dreamorosi for continuing to investigate and your suggestion on moving the captureLambdaHandler wiring.

It wasn't a deliberate choice, it just seemed the most intuitive to write at the time, but I agree that considering the order of execution, it makes more sense to have it first.

It may also work around whatever weakness there appears to be somewhere that causes the Segment to not be resolvable in the scenario above. Either way it will be a good data point so I'll report back after the change with anything I find.

@danrivett
Copy link
Author

danrivett commented Mar 8, 2023

Update on this, I managed to recreate the issue in my dev environment after reverting the failover logic as expected:

  1. On a cold start make a request with an invalid (empty) id
  • Client Error is returned as desired
  1. Make a valid request
  • Receive error back with Failed to get the current sub/segment from the context
  1. Continue to make valid requests, and all fail with the same error.

Now, when I move captureLambdaHandler earlier as suggested, I'm unable to recreate the issue, instead:

  1. On a cold start make a request with an invalid (empty) id
  • Client Error is returned as desired
  1. Make a valid request
  • Receive a valid response
  1. Continue to make valid requests, and all continue to succeed.

So that was a great suggestion @dreamorosi in more ways than one! Thank you.

Possible Follow-up Actions

I still think there is a weakness to investigate as time allows, in why the segment can't be resolved in this case, as it is possible that captureLambdaHandler may not be the first middleware. Though based on your observations, that may be rare enough to acknowledge and take no further action.

In addition I'd still suggest considering whether throwing errors in this middleware, unless absolutely necessary, is something that should be done longer term.

I am hopeful removing the error in Tracer.getSegment() and failing over may be a good starting point (even if it's less likely to occur based on what we've found out) mainly because of the oversized impact that it had, causing every subsequent request to also fail.

Regardless, feel free to close this ticket off once you've completed any intended actions.

@dreamorosi
Copy link
Contributor

Hi, thank you for reporting back.

I have been investigating this further and I have also got in touch internally with the X-Ray service team to better understand in which cases the SDK could return undefined when retrieving the main segment.

The team has confirmed that this should not happen in normal circumstances while running in Lambda. The getSegment-related methods in the SDK are typed to return undefined because this same SDK supports other execution environments like ECS and EC2, and in these circumstances we cannot guarantee that the Trace ID is well formed, hence the undefined.

In your specific case, without reviewing the code and telemetry, I'm unable to understand exactly why that happened.

Now, regarding the middleware behavior, I was able to confirm that given the following code, aka two custom middleware in which the first one in the stack throws:

import middy from '@middy/core';

const customMiddleware = () => {
  const customMiddlewareBefore = async () => {
    console.log('firstMiddlewareBefore');
    throw new Error("Error in customMiddlewareBefore");

  }

  const customMiddlewareAfter = async () => {
    console.log('firstMiddlewareAfter');
  }

  const customMiddlewareOnError = async () => {
    console.log('firstMiddlewareOnError');
  }

  return {
    before: customMiddlewareBefore,
    after: customMiddlewareAfter,
    onError: customMiddlewareOnError
  }
}

const secondMiddleware = () => {
  const customMiddlewareBefore = async () => {
    console.log('secondMiddlewareBefore');
  }

  const customMiddlewareAfter = async () => {
    console.log('secondMiddlewareAfter');
  }

  const customMiddlewareOnError = async () => {
    console.log('secondMiddlewareOnError');
  }

  return {
    before: customMiddlewareBefore,
    after: customMiddlewareAfter,
    onError: customMiddlewareOnError
  }
}

export const handler = middy(async () => {
  return {
    statusCode: 200,
    body: JSON.stringify({
      message: 'hello world',
    }),
  };
})
  .use(customMiddleware())
  .use(secondMiddleware());

results in this call order:

START RequestId: 5a18ccbf-7015-4133-adb6-981d5f41a307 Version: $LATEST
2023-03-09T18:15:00.505Z	5a18ccbf-7015-4133-adb6-981d5f41a307	INFO	firstMiddlewareBefore
2023-03-09T18:15:00.511Z	5a18ccbf-7015-4133-adb6-981d5f41a307	INFO	secondMiddlewareOnError
2023-03-09T18:15:00.511Z	5a18ccbf-7015-4133-adb6-981d5f41a307	INFO	firstMiddlewareOnError
2023-03-09T18:15:00.512Z	5a18ccbf-7015-4133-adb6-981d5f41a307	ERROR	Invoke Error 	{"errorType":"Error","errorMessage":"Error in customMiddlewareBefore","stack":["Error: Error in customMiddlewareBefore","    at customMiddlewareBefore (/var/task/index.js:7012:11)","    at runMiddlewares (/var/task/index.js:6925:23)","    at runRequest (/var/task/index.js:6882:11)","    at Runtime.middy2 [as handler] (/var/task/index.js:6833:12)","    at Runtime.handleOnceNonStreaming (file:///var/runtime/index.mjs:1085:29)"]}
END RequestId: 5a18ccbf-7015-4133-adb6-981d5f41a307
REPORT RequestId: 5a18ccbf-7015-4133-adb6-981d5f41a307	Duration: 48.62 ms	Billed Duration: 49 ms	Memory Size: 128 MB	Max Memory Used: 71 MB	Init Duration: 248.47 ms	
XRAY TraceId: 1-640a2224-466141706af3a2044f3553d1	SegmentId: 16354c752fc51107	Sampled: true	

Which I think confirms that in case one of the middleware throws, all the subsequent ones will run the OnError hook.

Before closing this issue I will open a PR to align the Tracer to the current behavior of the SDK and default to log a warning instead of throwing.

Thank you for the help while troubleshooting and investigating this, I appreciate the time you spent to provide info and contribute to improve the library!

@dreamorosi dreamorosi added confirmed The scope is clear, ready for implementation and removed discussing The issue needs to be discussed, elaborated, or refined labels Mar 13, 2023
@dreamorosi dreamorosi changed the title Bug: Tracer Error retrieving Segment led to outage Feature request: Tracer should log warning instead of throwing when segment is not found Mar 13, 2023
@dreamorosi dreamorosi added feature-request This item refers to a feature request for an existing or new utility and removed confirmed The scope is clear, ready for implementation labels Mar 13, 2023
@dreamorosi
Copy link
Contributor

Hi Dan, last update on this topic: I have updated the categorization and title of the issue to better align with my understanding of your latest message.

In parallel, I have also worked on a PR (#1370) that should cover most of the points we discussed in this issue.

The issue will be closed once the PR is merged.

@dreamorosi dreamorosi moved this from Working on it to Pending review in AWS Lambda Powertools for TypeScript Mar 13, 2023
@danrivett
Copy link
Author

Thanks @dreamorosi. I'm so appreciative of how responsive you've been to investigate and improve an already great library, as it does help us to make the case to standardize on using this library to improve our observability, so I am definitely happy about that.

Regarding your investigation, I wonder if you are able to use what you did to recreate the bug I found by also wiring in captureLambdaHandler as a 3rd middleware.

My understanding would be that if you made two or more requests, with the customMiddleware function only throwing an error on the first request, then the 2nd request onwards should throw an error from the Tracer.getSegment() call in captureLambdaHandlerBefore.

If that is the case, I'm hopeful that it may be something that could be passed onto the AWS X-Ray team to easily recreate and investigate on their end as it may be something very simple such as some in-memory state which isn't correct since captureLambdaHandlerBefore wasn't called the first time round.

Obviously for us we've moved the captureLambdaHandler ahead of those middleware functions, so it's not high importance to do this, but I thought it may be worth considering if we now have a simple problem recreation.

@github-project-automation github-project-automation bot moved this from Pending review to Coming soon in AWS Lambda Powertools for TypeScript Mar 20, 2023
@github-actions
Copy link
Contributor

⚠️ COMMENT VISIBILITY WARNING ⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@github-actions github-actions bot added the pending-release This item has been merged and will be released soon label Mar 20, 2023
@dreamorosi dreamorosi moved this from Coming soon to Shipped in AWS Lambda Powertools for TypeScript Mar 20, 2023
@dreamorosi dreamorosi added completed This item is complete and has been merged/shipped and removed pending-release This item has been merged and will be released soon bug Something isn't working labels Mar 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
completed This item is complete and has been merged/shipped feature-request This item refers to a feature request for an existing or new utility tracer This item relates to the Tracer Utility
Projects
None yet
2 participants