Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add agent self-observability. #716

Merged
merged 10 commits into from
Sep 15, 2024
Merged

Conversation

weixiang1862
Copy link
Member

Add agent self-observability.

  • If this is non-trivial feature, paste the links/URLs to the design doc.

  • Update the documentation to include this new feature.

  • Tests(including UT, IT, E2E) are added to verify the new feature.

  • If it's UI related, attach the screenshots below.

  • If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #<12595> agent part.

  • Update the CHANGES log.

@wu-sheng wu-sheng added enhancement New feature or request core feature labels Sep 13, 2024
@wu-sheng wu-sheng added this to the 9.4.0 milestone Sep 13, 2024
@wu-sheng
Copy link
Member

You need to fix CIs.

@wu-sheng
Copy link
Member

And we need to update docs about agent self dashboard.

@weixiang1862
Copy link
Member Author

docker compose scenarios all failed with scenario.sh: line 35: docker-compose: command not found, how can i fix this?

@wu-sheng
Copy link
Member

I think docker compose version changed?

Comment on lines 50 to 58
- name: Install docker-compose
shell: bash
if: runner.os != 'Windows'
run: |
if ! command docker-compose 2>&1 > /dev/null; then
echo "Installing docker-compose"
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kezhenxu94 Is this a good practice?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good choice, we did this because our infra e2e uses the docker-compose in third party SDK (testcontainers), that we cannot easily fix. Please fix as #716 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will revert this change.

name: created_ignored_context_counter
tags:
- {name: created_by, value: propagated}
singleValue: 1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is counter value 1 stable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this propagated context is created by new URL("http://localhost:8080/agent-so11y-scenario/case/propagated").getContent();

Comment on lines 34 to 35
private static final Counter PROPAGATED_CONTEXT_COUNTER = MeterFactory
.counter("created_tracing_context_counter").tag("created_by", "propagated").build();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the get method and lazy initialization. Because initializing through static field/block, could make some classes loaded before expected.

package org.apache.skywalking.apm.agent.core.so11y.bootstrap;

public interface BootstrapPluginSO11Y {
void recordInterceptorTimeCost(double timeCostInNanos);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void recordInterceptorTimeCost(double timeCostInNanos);
void duration(double timeCostInNanos);

public interface BootstrapPluginSO11Y {
void recordInterceptorTimeCost(double timeCostInNanos);

void recordInterceptorError(String pluginName, String interType);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void recordInterceptorError(String pluginName, String interType);
void error(String pluginName, String interType);


package org.apache.skywalking.apm.agent.core.so11y.bootstrap;

public interface BootstrapPluginSO11Y {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public interface BootstrapPluginSO11Y {
public interface BootstrapPluginSo11y {

* used by {@link BootstrapInterRuntimeAssist}
*/
@SuppressWarnings("unused")
public class BootstrapPluginSO11YBridge implements BootstrapPluginSO11Y {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public class BootstrapPluginSO11YBridge implements BootstrapPluginSO11Y {
public class BootstrapPluginSo11yBridge implements BootstrapPluginSO11Y {

.steps(Arrays.asList(0.01d, 0.1d, 0.5d, 1d, 3d, 5d, 10d, 50d, 100d, 200d, 500d, 1000d))
.build();

// interceptor error counter
Copy link
Member

@wu-sheng wu-sheng Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// interceptor error counter
// A map to cache meter obj(s) for plugins. The key is the plugin name.

// interceptor error counter
private static final Map<String, Counter> COUNTER_CACHE = new ConcurrentHashMap<>();

public static void recordTracingContextCreate(boolean forceSampling, boolean ignoredTracingContext) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static void recordTracingContextCreate(boolean forceSampling, boolean ignoredTracingContext) {
public static void measureTracingContextCreation(boolean forceSampling, boolean ignoredTracingContext) {

}
}

public static void recordTracingContextFinish(boolean ignoredTracingContext) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static void recordTracingContextFinish(boolean ignoredTracingContext) {
public static void measureTracingContextCompletion(boolean ignoredTracingContext) {

}
}

public static void recordLeakedTracingContext(boolean ignoredTracingContext) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static void recordLeakedTracingContext(boolean ignoredTracingContext) {
public static void measureLeakedTracingContext(boolean ignoredTracingContext) {

}
}

public static void recordInterceptorTimeCost(double timeCostInNanos) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static void recordInterceptorTimeCost(double timeCostInNanos) {
public static void durationOfInterceptor(double timeCostInNanos) {

INTERCEPTOR_TIME_COST.addValue(timeCostInNanos / 1000000);
}

public static void recordInterceptorError(String pluginName, String interType) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static void recordInterceptorError(String pluginName, String interType) {
public static void errorOfPlugin(String pluginName, String interType) {

private InstanceMethodsAroundInterceptorV2 interceptor;

public InstMethodsInterV2(String instanceMethodsAroundInterceptorClassName, ClassLoader classLoader) {
private static final String INTERCEPTOR_TYPE = "inst";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move all static final field at the top.

@@ -70,12 +75,16 @@ public Object intercept(@This Object obj, @AllArguments Object[] allArguments, @
@Morph OverrideCallable zuper) throws Throwable {
EnhancedInstance targetObject = (EnhancedInstance) obj;

long interceptorTimeCost = 0L;
long beforeStartTime = System.nanoTime();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
long beforeStartTime = System.nanoTime();
long startTimeOfMethodBeforeInter = System.nanoTime();

Let's use this naming style. beforeStartTime like before start time, which is not our meaning.

@kezhenxu94
Copy link
Member

docker compose scenarios all failed with scenario.sh: line 35: docker-compose: command not found, how can i fix this?

Please change all docker-compose command in scenario.sh to docker compose

@wu-sheng
Copy link
Member

https://skywalking.apache.org/docs/main/next/en/setup/backend/dashboards-so11y-java-agent/ → Status: 404

I think we don't have a pull request for that?

@weixiang1862
Copy link
Member Author

https://skywalking.apache.org/docs/main/next/en/setup/backend/dashboards-so11y-java-agent/ → Status: 404

I think we don't have a pull request for that?

Yes, it will be submit this weekend.

- name: "Plugin development guide"
- name: "Agent Self Observability"
path: "/en/setup/service-agent/java-agent/Java-Agent-self-observability"
- name: "Plugin Development Guide"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add Plugin Test doc next to this? I think we missed it from menu.

@@ -0,0 +1,13 @@
# Java Agent self-observability
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Java Agent self-observability
# Agent Self Observability

if (INTERCEPTOR_TIME_COST == null) {
INTERCEPTOR_TIME_COST = MeterFactory
.histogram("tracing_context_performance")
.steps(Arrays.asList(0.01d, 0.1d, 0.5d, 1d, 3d, 5d, 10d, 50d, 100d, 200d, 500d, 1000d))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrays.asList should be executed once. Please make them a static variable.

.steps(Arrays.asList(0.01d, 0.1d, 0.5d, 1d, 3d, 5d, 10d, 50d, 100d, 200d, 500d, 1000d))
.build();
}
INTERCEPTOR_TIME_COST.addValue(timeCostInNanos / 1000000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics should be raw data, right? I don't think a method of interceptor would cost over 1 ms?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have concern of long overflow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the value look like in your testing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a group of time cost(nanos) in agent-so11y-scenario, we should update histogram steps too, right?

4453.0 
1.5740867E7 
1875.0 
3276344.0 
1923509.0 
2316475.0 
932202.0 
37576.0 
1706292.0 
1846.0 
303985.0 
126229.0 
3075.0 
203682.0 
1919.0 
29810.0 
2636649.0 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think they should be aligned. Consider 4k ns, I think to keep ns as unit makes sense. Agree?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ns as unit make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the first buckets of histogram as 1000ns.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the first buckets of histogram as 1000ns.

Is this step ok?

private static final List<Double> TIME_COST_HISTOGRAM_STEPS = Arrays.asList(
    1000d, 10000d, 50000d, 100000d, 300000d, 500000d,
    1000000d, 5000000d, 10000000d, 20000000d, 50000000d, 100000000d
);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max is 100ms per intercepting. I think it is a very bad performance.

Let's do this anyway. I am not sure how bad an interceptor would be.

- `created_ignored_context_counter` and `finished_ignored_context_counter`. Same concepts like `*_tracing_context_counter`.
- `interceptor_error_counter` - Counter. The number of errors happened in the interceptor logic, with `label=plugin_name, inter_type(constructor, inst, static)`. We don't add interceptor names into labels in case of OOM. The number of plugins is only dozens, it is predictable, but the number of interceptors will be hundreds.
- `possible_leaked_context_counter` - Counter. The number of detected leaked contexts. It should include the `label=source(value=tracing, ignore)`. When `source=tracing`, it is today's shadow tracing context. But now, it is measured.
- `tracing_context_performance` - Histogram. For successfully finished tracing context, it measures every interceptor's time cost(by using nanoseconds), the buckets of the histogram are {0.001, 0.01, 0.05, 0.1, 0.3, 0.5, 1, 5, 10, 20, 50, 100}ms. This provides the performance behavior for the tracing operations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we should use ns as unit, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@wu-sheng
Copy link
Member

CI fails, please fix.

@wu-sheng wu-sheng merged commit 7536424 into apache:main Sep 15, 2024
191 of 192 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core enhancement New feature or request feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants