-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Tracer should log warning instead of throwing when segment is not found #1356
Comments
Hi @danrivett thank you for reporting this, I'll review the report tomorrow morning (EMEA time zone) and write here my findings. If needed after finding and applying a fix, I'm open to do an emergency release to avoid further outages. |
Thanks for such a quick response @dreamorosi especially given your timezone. For now export class CustomTracer extends Tracer {
public constructor(options?: TracerOptions) {
super(options);
this.getSegment = this.getSegment.bind(this);
this.createFallbackSubSegment = this.createFallbackSubSegment.bind(this);
}
getSegment(): Segment | Subsegment {
try {
return super.getSegment();
} catch (error) {
console.error(
'Failed to get the current sub/segment from the context, so failing over to manually create one to allow the request to proceed.',
error
);
return this.createFallbackSubSegment();
}
}
protected createFallbackSubSegment(): Subsegment {
return new Subsegment(
'## Fallback segment (as unable to retrieve existing Segment)'
);
}
} Note: This external fix is slightly different from the one suggested above because I'm wrapping the whole |
So, I've started looking into this and on the Powertools' side, I was able to reproduce the behavior with this unit test case: test('when no subsegment is returned, everything blows up', async () => {
// Prepare
const tracer: Tracer = new Tracer();
jest.spyOn(tracer.provider, 'getSegment')
.mockImplementationOnce(() => undefined);
const lambdaHandler: Handler = async (_event: unknown, _context: Context) => ({
foo: 'bar'
});
const handler = middy(lambdaHandler).use(captureLambdaHandler(tracer));
// Act & Assess
await expect(handler({}, context)).rejects.toThrowError(
Error('Failed to get the current sub/segment from the context.')
);
}); Now, by looking at the logs in the OP, and looking at the call stack in them, it seems that the first
This points to the fact that for some reason the X-Ray SDK returned I need to investigate a bit further to try to understand why that could have happened, but as far as I remember this shouldn't happen unless I agree that the behavior of throwing an error is not explicitly documented, and that in most cases the current behavior might not be the best/desired one. However, strictly speaking I'm not sure yet if this should be marked as bug, given that Tracer behaved as it was supposed to (aka throw an error when no segment is returned). With that said, I'm open to discuss and explore the possibility to change this into a warning like you suggested, or to at least give users a way to decide the behavior similar to what the X-Ray SDK does with the In the meanwhile, would you be open to share the final log of a few of these faulty executions? I'm looking to see, if possible, the report log and see if the tracing-related info were present. If you prefer to do so in a less public forum, you can also email them at aws-lambda-powertools-feedback@amazon.com. |
Would like to amend a section of my previous comment:
After looking again at the logs you provided, and specifically at the call stack, we can see in the first line of each log error this line:
Based on this entry of the call stack, it seems that the root cause of the issue is the fact that both the X-Ray SDK and Tracer were not able to find any trace data. This usually happens when trying to retrieve a segment or perform actions on it outside of the context of an invocation. To continue with the investigation, could you please also provide two more pieces of info:
More details on how the AWS X-Ray SDK (from now on referred as SDK) resolves the trace context/data, note that all this happens upstream of Powertools.
Now, the codepath that leads to the stack in your logs, as far as I know, can be triggered only by first calling one of these three methods of the SDK:
All these methods are related to segment and context and the reason why is that is because the SDK stores the current segment in this context object (from this package which acts as custom implementation/polyfill of Async Hook present in more modern Node.js versions). In a "regular" execution, the codepath executed should be this, which essentially tells the SDK to create a new If we look at the logic of these three methods, the error related to the import { getSegment } from 'aws-xray-sdk-core`;
const segment = getSegment(); // <- This will trigger a context missing
export const handler = () => {}; I know the example above is extremely trivial and simplistic, but is there any chance that this could be happening somewhere in your code, either outside of the handler or in middleware that run before there any chance that |
Thank you so much @dreamorosi for taking such a timely and detailed look into this bug report. It really is greatly appreciated. It is still early on the west coast here, and we still have some actions remaining from the outage yesterday that I need to attend to first, but I will respond later today with more information that you requested. I will post what I can here, and the rest I will email. However three brief updates:
My final point though, is that regardless of the above, I think there is a design decision that needs to be carefully thought through and made: should an error in any instrumentation middleware - such as this X-Ray one, but not limited to that - be severe enough to warrant not executing the wrapped underlying function successfully? One can argue both sides to this, but as I mentioned above I view instrumentation as best-effort and auxillary, and shouldn't itself prevent the underlying function from executing if it encounters an error. This could of course be configurable, and although I wouldn't recommend it, the default could be to block the execution of the underlying function if it encounters an error, as there are valid reasons for this. As I mentioned we have middleware with polices the input, so of course we want those errors to block execution of the underlying function. But I don't view the failure of tracing instrumentation as cause for blocking the execution of the underlying function. I would much rather log a warning and alert off of that for us to investigate. So regardless of the root cause of why To put this in context, there was a real customer impact for over an hour yesterday because of the errors being thrown, I'd argue not tracing those requests and allowing them to proceed in this error scenario is a much better alternative, which is why I am putting so much effort in explaining my reasoning. We can always and should work to fix the root cause of issues (that's part of my investigation today), but there will always be bugs and so I feel we have to decide if we want to design middleware that gives the underlying request the greatest chance of success and so tolerate certain error scenarios what shouldn't but can happen. One final addendum: The workaround above seems to be working well for us, overnight we had 8 function invocations out of 12,000 which encountered not being able to resolve the Segment, but they failed over and didn't cause subsequent executions to fail until the next cold start.
Thanks again @dreamorosi your huge efforts on this library and helping customers is not unnoticed and very appreciated. |
Thank you for the additional details, I just wanted to clarify that my investigation for a root cause is not to defer responsibility to another package or to your code but only to try to identify the actual issue that caused the outage, and try to understand if this is something that is inherent to Tracer, and if so, that should be fixed so that it doesn't affect other customers. The above is separate and independent from the fact that as you say tracing throwing an error should not break the whole middleware stack. I agree with that and after looking at the X-Ray SDK, which recently changed their default behavior to just log when this happens, I think we should do the same. I'm happy to move forward with this fix, however if you are available I'd still like to understand what caused the issue. I'm concerned that if we apply the fix and just log, we'll be patching the behavior (aka the error being thrown), but we wouldn't be fixing the actual issue that caused it. However I also understand that you might have other priorities or not be interested in dedicating time to this, so if that's the case I 100% understand and there's no expectation whatsoever from our side on you doing that. |
Thanks @dreamorosi I definitely want to provide additional info to help determine the root cause - we ourselves want to also understand that for similar reasons. My gut is there is an edge condition somewhere, possibly in the X-Ray SDK, but could easily be outside that too, it's too early for me to draw many conclusions. I've been continuing to investigate, and it's still ongoing, but I do have some additional information I've uncovered that I can share here below. But first I want to clarify an incorrect observation I made above (that's the danger of reporting timely but prematurely) - those 8 function invocations were not clustered, and in fact were 4 function invocations and each error had two log entries that matched by my query (one from the log message from I thought it was worth updating on that as those 4 cases were isolated and not clustered as I first thought. |
The additional info I've found so far centre on the 4 failed-over requests overnight. I think it should be helpful to your investigation too. All 4 of these showed the same pattern:
It's worth pointing out that this request finished successfully (with a client error object returned). Here's what we logged for that first request. This is logged by the 2nd middleware layer which caught the client error and returned the error object: {
"cold_start": true,
"function_arn": "xxx",
"function_memory_size": 512,
"function_name": "xxx",
"function_request_id": "xxx",
"level": "INFO",
"message": "ClientError (InputError) caught ('Input.BusinessIdMissing'); returning error object to requestor.",
"service": "xxx",
"timestamp": "2023-03-08T01:20:35.328Z",
"xray_trace_id": "xxx",
"error": {
"returnObject": {
"message": "No business id received",
"error": {
"type": "Input.BusinessIdMissing",
"code": "BusinessIdMissing",
"category": "Input"
},
"errorInfo": {
"fieldName": "businessId",
"fieldValue": ""
}
},
"class": "InputError",
"stack": [
"Error: No business id received",
"at new InputError (/var/task/index.js:23155:5)",
"at validateBusinessId (/var/task/index.js:23223:11)",
"at /var/task/index.js:26810:23",
"at runMiddlewares (/var/task/index.js:26698:23)",
"at async runRequest (/var/task/index.js:26656:5)"
]
}
}
Subsequent requests complete successfully too with no additional failover logged, but I'm assuming they're all using the failed-over Important DetailIn all 4 cases, the request in 1. above which failed due to a missing id happened on a cold start. I think this important as an error was thrown by one middleware layer before being caught by another (which I believe is permitable even if unusual) but this error thrown seems to have disrupted AWS X-Ray from being able to obtain the segment correctly. The fact though that the first request on a cold start was an invalid one, seems to be a key reason why we didn't see But it's also odd that on the invalid request itself, the The order in which the middleware functions are added also appears to be contributing because we're currently adding the tracer after the custom middleware because tracer is optional for us (though currently is defaulting to enabled): mHandler = mHandler
.use(injectLambdaContext(logger, { clearState: true }))
.use(verifyRequest(logger)) // <- Error thrown here
.use(returnClientErrors(logger)); // <- Error caught here
}
if (tracer) {
mHandler = mHandler.use(captureLambdaHandler(tracer)); // <- Tracer middleware added here
} Looking at this now, my assumption is that on that first request I'm not sure if captureLambdaHandlerAfter or captureLambdaHandlerError is still called in that case? My assumption is Either way, both That explanation doesn't make sense to me because Sorry for so much detail, but I'm not sure what is helpful or not. I am going to investigate and test moving the I am also curious if there's a weakness in Given the above, I think it may be possible to recreate this by creating your own middleware that mimic Let me know if there's more info you'd like. Thanks @dreamorosi. |
Hi Dan, this is great, and I appreciate you taking the time to share all these info. I think I have enough to continue the investigation. I'd like to focus my attention to the behavior you have described here:
I think you are onto something here, however I need to simulate this middleware stack to confirm our understanding. I'll continue my investigation tomorrow and report back here. One note, not necessarily related to this issue/outage, but that I still want to share based on the code snippet you shared above: mHandler = mHandler
.use(injectLambdaContext(logger, { clearState: true }))
.use(verifyRequest(logger)) // <- Error thrown here
.use(returnClientErrors(logger)); // <- Error caught here
}
if (tracer) {
mHandler = mHandler.use(captureLambdaHandler(tracer)); // <- Tracer middleware added here
} If possible, I'd recommend refactoring this to something along these lines: if (tracer) {
mHandler = mHandler.use(captureLambdaHandler(tracer)); // <- Tracer middleware added here
}
mHandler = mHandler
.use(injectLambdaContext(logger, { clearState: true }))
.use(verifyRequest(logger)) // <- Error thrown here
.use(returnClientErrors(logger)); // <- Error caught here The reason why I suggest this is due to how Middy handles the middleware stack (described here). Given your original middleware stack above, the execution order would be like:
Unless this was an explicit choice, this means that the Regardless of the handler throwing or not, ideally you'd want the |
Thanks so much @dreamorosi for continuing to investigate and your suggestion on moving the It wasn't a deliberate choice, it just seemed the most intuitive to write at the time, but I agree that considering the order of execution, it makes more sense to have it first. It may also work around whatever weakness there appears to be somewhere that causes the |
Update on this, I managed to recreate the issue in my dev environment after reverting the failover logic as expected:
Now, when I move
So that was a great suggestion @dreamorosi in more ways than one! Thank you. Possible Follow-up ActionsI still think there is a weakness to investigate as time allows, in why the segment can't be resolved in this case, as it is possible that In addition I'd still suggest considering whether throwing errors in this middleware, unless absolutely necessary, is something that should be done longer term. I am hopeful removing the error in Regardless, feel free to close this ticket off once you've completed any intended actions. |
Hi, thank you for reporting back. I have been investigating this further and I have also got in touch internally with the X-Ray service team to better understand in which cases the SDK could return The team has confirmed that this should not happen in normal circumstances while running in Lambda. The In your specific case, without reviewing the code and telemetry, I'm unable to understand exactly why that happened. Now, regarding the middleware behavior, I was able to confirm that given the following code, aka two custom middleware in which the first one in the stack throws: import middy from '@middy/core';
const customMiddleware = () => {
const customMiddlewareBefore = async () => {
console.log('firstMiddlewareBefore');
throw new Error("Error in customMiddlewareBefore");
}
const customMiddlewareAfter = async () => {
console.log('firstMiddlewareAfter');
}
const customMiddlewareOnError = async () => {
console.log('firstMiddlewareOnError');
}
return {
before: customMiddlewareBefore,
after: customMiddlewareAfter,
onError: customMiddlewareOnError
}
}
const secondMiddleware = () => {
const customMiddlewareBefore = async () => {
console.log('secondMiddlewareBefore');
}
const customMiddlewareAfter = async () => {
console.log('secondMiddlewareAfter');
}
const customMiddlewareOnError = async () => {
console.log('secondMiddlewareOnError');
}
return {
before: customMiddlewareBefore,
after: customMiddlewareAfter,
onError: customMiddlewareOnError
}
}
export const handler = middy(async () => {
return {
statusCode: 200,
body: JSON.stringify({
message: 'hello world',
}),
};
})
.use(customMiddleware())
.use(secondMiddleware()); results in this call order:
Which I think confirms that in case one of the middleware throws, all the subsequent ones will run the Before closing this issue I will open a PR to align the Tracer to the current behavior of the SDK and default to log a warning instead of throwing. Thank you for the help while troubleshooting and investigating this, I appreciate the time you spent to provide info and contribute to improve the library! |
Hi Dan, last update on this topic: I have updated the categorization and title of the issue to better align with my understanding of your latest message. In parallel, I have also worked on a PR (#1370) that should cover most of the points we discussed in this issue. The issue will be closed once the PR is merged. |
Thanks @dreamorosi. I'm so appreciative of how responsive you've been to investigate and improve an already great library, as it does help us to make the case to standardize on using this library to improve our observability, so I am definitely happy about that. Regarding your investigation, I wonder if you are able to use what you did to recreate the bug I found by also wiring in My understanding would be that if you made two or more requests, with the If that is the case, I'm hopeful that it may be something that could be passed onto the AWS X-Ray team to easily recreate and investigate on their end as it may be something very simple such as some in-memory state which isn't correct since Obviously for us we've moved the |
|
Expected Behaviour
Today, for an unknown reason in one of our Lambdas in Production, the
Tracer
object was unable to retrieve the currentSegment
from X-Ray.Because the
Segment
was attempted to be retrieved in the middy middleware at the beginning of every request (here) it caused every Lambda request to fail until we forced a cold-start of the Lambda, which resolved our problem.It caused the request to fail because the
Tracer
throws an error if theSegment
returned isundefined
.Unfortunately our Lambda was kept warm for over an hour before we tracked it down and forced a cold-start. In the meantime every request failed (see execution logs below).
My expected behaviour would be that if a Segment could not be retrieved, that it doesn't cause the whole Lambda invocation to fail, which it currently does when using the
middy
middleware.Tracing should be done on a best-effort basis, if there is any problem tracing a request it should not cause the main request itself to fail. This particular scenario of getting the
Segment
is one specific case that could be improved to avoid failing the request, but it's quite possible there may be other code-paths that are vulnerable to causing the main request to fail, particularly if instrumented by the captureLambdaHandler middy middleware.Current Behaviour
As mentioned above, the
Tracer
currently throws an error if theSegment
returned isundefined
, this causes the middy middleware to fail every request, sincegetSegment()
is called in thecaptureLambdaHandlerBefore
function (here).Code snippet
You could wire in a dummy Tracing provider to return
undefined
from itsgetSegment()
method, called here.Steps to Reproduce
We don't know yet what caused
aws-xray-sdk-core
'sgetSegment()
(here) to returnundefined
, but if you can simulate that, and use thecaptureLambdaHandler
middy middleware, you will find that it causes the wrapped request to fail.For us it caused the request to fail until the function was cold-started since
getSegment()
never started returning aSegment
object.Possible Solution
My suggestion would be to consider updating the Tracer.getSegment() method to log a warning and return a dummy
SubSegment
if a real segment couldn't be retrieved, similar to how it already does if tracing is disabled.Something like as follows:
AWS Lambda Powertools for TypeScript version
1.6.0
AWS Lambda function runtime
18.x
Packaging format used
Lambda Layers
Execution logs
The text was updated successfully, but these errors were encountered: