-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add agent self-observability. #716
Conversation
You need to fix CIs. |
And we need to update docs about agent self dashboard. |
docker compose scenarios all failed with |
I think docker compose version changed? |
.github/actions/run/action.yml
Outdated
- name: Install docker-compose | ||
shell: bash | ||
if: runner.os != 'Windows' | ||
run: | | ||
if ! command docker-compose 2>&1 > /dev/null; then | ||
echo "Installing docker-compose" | ||
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose | ||
sudo chmod +x /usr/local/bin/docker-compose | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kezhenxu94 Is this a good practice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a good choice, we did this because our infra e2e uses the docker-compose in third party SDK (testcontainers), that we cannot easily fix. Please fix as #716 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will revert this change.
name: created_ignored_context_counter | ||
tags: | ||
- {name: created_by, value: propagated} | ||
singleValue: 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is counter value 1 stable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this propagated context is created by new URL("http://localhost:8080/agent-so11y-scenario/case/propagated").getContent();
private static final Counter PROPAGATED_CONTEXT_COUNTER = MeterFactory | ||
.counter("created_tracing_context_counter").tag("created_by", "propagated").build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use the get
method and lazy initialization. Because initializing through static field/block, could make some classes loaded before expected.
package org.apache.skywalking.apm.agent.core.so11y.bootstrap; | ||
|
||
public interface BootstrapPluginSO11Y { | ||
void recordInterceptorTimeCost(double timeCostInNanos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void recordInterceptorTimeCost(double timeCostInNanos); | |
void duration(double timeCostInNanos); |
public interface BootstrapPluginSO11Y { | ||
void recordInterceptorTimeCost(double timeCostInNanos); | ||
|
||
void recordInterceptorError(String pluginName, String interType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void recordInterceptorError(String pluginName, String interType); | |
void error(String pluginName, String interType); |
|
||
package org.apache.skywalking.apm.agent.core.so11y.bootstrap; | ||
|
||
public interface BootstrapPluginSO11Y { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public interface BootstrapPluginSO11Y { | |
public interface BootstrapPluginSo11y { |
* used by {@link BootstrapInterRuntimeAssist} | ||
*/ | ||
@SuppressWarnings("unused") | ||
public class BootstrapPluginSO11YBridge implements BootstrapPluginSO11Y { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public class BootstrapPluginSO11YBridge implements BootstrapPluginSO11Y { | |
public class BootstrapPluginSo11yBridge implements BootstrapPluginSO11Y { |
.steps(Arrays.asList(0.01d, 0.1d, 0.5d, 1d, 3d, 5d, 10d, 50d, 100d, 200d, 500d, 1000d)) | ||
.build(); | ||
|
||
// interceptor error counter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// interceptor error counter | |
// A map to cache meter obj(s) for plugins. The key is the plugin name. |
// interceptor error counter | ||
private static final Map<String, Counter> COUNTER_CACHE = new ConcurrentHashMap<>(); | ||
|
||
public static void recordTracingContextCreate(boolean forceSampling, boolean ignoredTracingContext) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public static void recordTracingContextCreate(boolean forceSampling, boolean ignoredTracingContext) { | |
public static void measureTracingContextCreation(boolean forceSampling, boolean ignoredTracingContext) { |
} | ||
} | ||
|
||
public static void recordTracingContextFinish(boolean ignoredTracingContext) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public static void recordTracingContextFinish(boolean ignoredTracingContext) { | |
public static void measureTracingContextCompletion(boolean ignoredTracingContext) { |
} | ||
} | ||
|
||
public static void recordLeakedTracingContext(boolean ignoredTracingContext) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public static void recordLeakedTracingContext(boolean ignoredTracingContext) { | |
public static void measureLeakedTracingContext(boolean ignoredTracingContext) { |
} | ||
} | ||
|
||
public static void recordInterceptorTimeCost(double timeCostInNanos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public static void recordInterceptorTimeCost(double timeCostInNanos) { | |
public static void durationOfInterceptor(double timeCostInNanos) { |
INTERCEPTOR_TIME_COST.addValue(timeCostInNanos / 1000000); | ||
} | ||
|
||
public static void recordInterceptorError(String pluginName, String interType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public static void recordInterceptorError(String pluginName, String interType) { | |
public static void errorOfPlugin(String pluginName, String interType) { |
private InstanceMethodsAroundInterceptorV2 interceptor; | ||
|
||
public InstMethodsInterV2(String instanceMethodsAroundInterceptorClassName, ClassLoader classLoader) { | ||
private static final String INTERCEPTOR_TYPE = "inst"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move all static final
field at the top.
@@ -70,12 +75,16 @@ public Object intercept(@This Object obj, @AllArguments Object[] allArguments, @ | |||
@Morph OverrideCallable zuper) throws Throwable { | |||
EnhancedInstance targetObject = (EnhancedInstance) obj; | |||
|
|||
long interceptorTimeCost = 0L; | |||
long beforeStartTime = System.nanoTime(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
long beforeStartTime = System.nanoTime(); | |
long startTimeOfMethodBeforeInter = System.nanoTime(); |
Let's use this naming style. beforeStartTime
like before start time
, which is not our meaning.
Please change all |
I think we don't have a pull request for that? |
Yes, it will be submit this weekend. |
- name: "Plugin development guide" | ||
- name: "Agent Self Observability" | ||
path: "/en/setup/service-agent/java-agent/Java-Agent-self-observability" | ||
- name: "Plugin Development Guide" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add Plugin Test doc next to this? I think we missed it from menu.
@@ -0,0 +1,13 @@ | |||
# Java Agent self-observability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Java Agent self-observability | |
# Agent Self Observability |
docs/en/setup/service-agent/java-agent/Java-Agent-self-observability.md
Outdated
Show resolved
Hide resolved
if (INTERCEPTOR_TIME_COST == null) { | ||
INTERCEPTOR_TIME_COST = MeterFactory | ||
.histogram("tracing_context_performance") | ||
.steps(Arrays.asList(0.01d, 0.1d, 0.5d, 1d, 3d, 5d, 10d, 50d, 100d, 200d, 500d, 1000d)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrays.asList should be executed once. Please make them a static variable.
.steps(Arrays.asList(0.01d, 0.1d, 0.5d, 1d, 3d, 5d, 10d, 50d, 100d, 200d, 500d, 1000d)) | ||
.build(); | ||
} | ||
INTERCEPTOR_TIME_COST.addValue(timeCostInNanos / 1000000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metrics should be raw data, right? I don't think a method of interceptor would cost over 1 ms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have concern of long overflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the value look like in your testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a group of time cost(nanos) in agent-so11y-scenario
, we should update histogram steps too, right?
4453.0
1.5740867E7
1875.0
3276344.0
1923509.0
2316475.0
932202.0
37576.0
1706292.0
1846.0
303985.0
126229.0
3075.0
203682.0
1919.0
29810.0
2636649.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think they should be aligned. Consider 4k ns, I think to keep ns as unit makes sense. Agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, ns as unit make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the first buckets of histogram as 1000ns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the first buckets of histogram as 1000ns.
Is this step ok?
private static final List<Double> TIME_COST_HISTOGRAM_STEPS = Arrays.asList(
1000d, 10000d, 50000d, 100000d, 300000d, 500000d,
1000000d, 5000000d, 10000000d, 20000000d, 50000000d, 100000000d
);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The max is 100ms per intercepting. I think it is a very bad performance.
Let's do this anyway. I am not sure how bad an interceptor would be.
- `created_ignored_context_counter` and `finished_ignored_context_counter`. Same concepts like `*_tracing_context_counter`. | ||
- `interceptor_error_counter` - Counter. The number of errors happened in the interceptor logic, with `label=plugin_name, inter_type(constructor, inst, static)`. We don't add interceptor names into labels in case of OOM. The number of plugins is only dozens, it is predictable, but the number of interceptors will be hundreds. | ||
- `possible_leaked_context_counter` - Counter. The number of detected leaked contexts. It should include the `label=source(value=tracing, ignore)`. When `source=tracing`, it is today's shadow tracing context. But now, it is measured. | ||
- `tracing_context_performance` - Histogram. For successfully finished tracing context, it measures every interceptor's time cost(by using nanoseconds), the buckets of the histogram are {0.001, 0.01, 0.05, 0.1, 0.3, 0.5, 1, 5, 10, 20, 50, 100}ms. This provides the performance behavior for the tracing operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we should use ns as unit, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
CI fails, please fix. |
Add agent self-observability.
If this is non-trivial feature, paste the links/URLs to the design doc.
Update the documentation to include this new feature.
Tests(including UT, IT, E2E) are added to verify the new feature.
If it's UI related, attach the screenshots below.
If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #<12595> agent part.
Update the
CHANGES
log.