Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APR-190] Change the default metric compression kind to be zstd. #32087

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

StephenWakely
Copy link
Contributor

@StephenWakely StephenWakely commented Dec 12, 2024

This changes the default compression kind for metrics from zlib to zstd.

The setting can be reverted in the configuration with:

serializer_compressor_kind: zlib

What does this PR do?

Motivation

zstd produces a significantly better compression ratio to zlib.

Describe how you validated your changes

To validate, make sure the agent will send metric payloads compressed with zstd by default.

  • Manual check: have an Agent sending its traffic on a proxy/traffic dumper and validate that it's sending "Content-Encoding: zstd"+ zstd payloads
  • General validation: validate that metrics reach the intake and are visible in graphs.

Possible Drawbacks / Trade-offs

zstd does use more memory than zlib.

Additional Notes

The default compression level is set to 1 - the lowest available. But can be reconfigured with.

serializer_zstd_compressor_level: 1

Signed-off-by: Stephen Wakely <fungus.humungus@gmail.com>
Copy link
Contributor

@remeh remeh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can you write the instructions on how to revert to zlib in the CHANGELOG entry? This way it will end up in the public CHANGELOG 👍

@agent-platform-auto-pr
Copy link
Contributor

agent-platform-auto-pr bot commented Dec 12, 2024

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv aws.create-vm --pipeline-id=51428771 --os-family=ubuntu

Note: This applies to commit a82e242

@agent-platform-auto-pr
Copy link
Contributor

agent-platform-auto-pr bot commented Dec 12, 2024

Package size comparison

Comparison with ancestor 97fdc4dd76a239705a17f1d71b73eb43681f2304

Diff per package
package diff status size ancestor threshold
datadog-agent-amd64-deb 0.01MB ⚠️ 1271.60MB 1271.59MB 140.00MB
datadog-iot-agent-amd64-deb 0.00MB 113.29MB 113.29MB 10.00MB
datadog-dogstatsd-amd64-deb 0.00MB 78.41MB 78.41MB 10.00MB
datadog-heroku-agent-amd64-deb -0.00MB 526.66MB 526.66MB 70.00MB
datadog-agent-x86_64-rpm 0.01MB ⚠️ 1280.83MB 1280.83MB 140.00MB
datadog-agent-x86_64-suse 0.01MB ⚠️ 1280.83MB 1280.83MB 140.00MB
datadog-iot-agent-x86_64-rpm 0.00MB 113.36MB 113.36MB 10.00MB
datadog-iot-agent-x86_64-suse 0.00MB 113.36MB 113.36MB 10.00MB
datadog-dogstatsd-x86_64-rpm 0.00MB 78.49MB 78.49MB 10.00MB
datadog-dogstatsd-x86_64-suse 0.00MB 78.49MB 78.49MB 10.00MB
datadog-agent-arm64-deb 0.01MB ⚠️ 1005.68MB 1005.67MB 140.00MB
datadog-iot-agent-arm64-deb 0.00MB 108.77MB 108.77MB 10.00MB
datadog-dogstatsd-arm64-deb 0.00MB 55.65MB 55.65MB 10.00MB
datadog-agent-aarch64-rpm 0.01MB ⚠️ 1014.89MB 1014.89MB 140.00MB
datadog-iot-agent-aarch64-rpm 0.00MB 108.84MB 108.84MB 10.00MB

Decision

⚠️ Warning

Copy link

cit-pr-commenter bot commented Dec 12, 2024

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 3e0414a2-9b55-4797-84c5-8046b49a03a7

Baseline: 97fdc4d
Comparison: 6c37b20
Diff

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI trials links
uds_dogstatsd_to_api_cpu % cpu utilization +3.71 [+3.39, +4.03] 1 Logs
quality_gate_idle_all_features memory utilization +0.93 [+0.88, +0.98] 1 Logs bounds checks dashboard
file_to_blackhole_100ms_latency egress throughput +0.08 [-0.26, +0.41] 1 Logs
file_to_blackhole_0ms_latency egress throughput +0.04 [-0.32, +0.41] 1 Logs
file_to_blackhole_500ms_latency egress throughput +0.04 [-0.30, +0.38] 1 Logs
file_to_blackhole_0ms_latency_http2 egress throughput +0.03 [-0.35, +0.41] 1 Logs
otel_to_otel_logs ingress throughput +0.02 [-0.29, +0.33] 1 Logs
file_to_blackhole_300ms_latency egress throughput +0.00 [-0.28, +0.29] 1 Logs
uds_dogstatsd_to_api ingress throughput +0.00 [-0.04, +0.05] 1 Logs
file_tree memory utilization +0.00 [-0.05, +0.05] 1 Logs
tcp_dd_logs_filter_exclude ingress throughput +0.00 [-0.01, +0.01] 1 Logs
file_to_blackhole_0ms_latency_http1 egress throughput -0.00 [-0.39, +0.38] 1 Logs
quality_gate_idle memory utilization -0.04 [-0.06, -0.02] 1 Logs bounds checks dashboard
file_to_blackhole_1000ms_latency_linear_load egress throughput -0.06 [-0.27, +0.15] 1 Logs
quality_gate_logs % cpu utilization -0.23 [-1.54, +1.09] 1 Logs
tcp_syslog_to_blackhole ingress throughput -0.40 [-0.42, -0.37] 1 Logs
file_to_blackhole_1000ms_latency egress throughput -0.52 [-0.87, -0.17] 1 Logs

Bounds Checks: ❌ Failed

perf experiment bounds_check_name replicates_passed links
file_to_blackhole_0ms_latency lost_bytes 48/50
file_to_blackhole_0ms_latency_http1 lost_bytes 49/50
file_to_blackhole_0ms_latency_http2 lost_bytes 49/50
file_to_blackhole_100ms_latency lost_bytes 49/50
file_to_blackhole_500ms_latency lost_bytes 49/50
file_to_blackhole_0ms_latency memory_usage 50/50
file_to_blackhole_0ms_latency_http1 memory_usage 50/50
file_to_blackhole_0ms_latency_http2 memory_usage 50/50
file_to_blackhole_1000ms_latency memory_usage 50/50
file_to_blackhole_1000ms_latency_linear_load memory_usage 50/50
file_to_blackhole_100ms_latency memory_usage 50/50
file_to_blackhole_300ms_latency lost_bytes 50/50
file_to_blackhole_300ms_latency memory_usage 50/50
file_to_blackhole_500ms_latency memory_usage 50/50
quality_gate_idle memory_usage 50/50 bounds checks dashboard
quality_gate_idle_all_features memory_usage 50/50 bounds checks dashboard
quality_gate_logs lost_bytes 50/50
quality_gate_logs memory_usage 50/50

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

CI Pass/Fail Decision

Passed. All Quality Gates passed.

  • quality_gate_idle, bounds check memory_usage: 50/50 replicas passed. Gate passed.
  • quality_gate_logs, bounds check memory_usage: 50/50 replicas passed. Gate passed.
  • quality_gate_logs, bounds check lost_bytes: 50/50 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check memory_usage: 50/50 replicas passed. Gate passed.

Signed-off-by: Stephen Wakely <fungus.humungus@gmail.com>
Signed-off-by: Stephen Wakely <fungus.humungus@gmail.com>
@StephenWakely StephenWakely requested a review from a team as a code owner December 12, 2024 12:46
StephenWakely and others added 2 commits December 13, 2024 10:51
Signed-off-by: Stephen Wakely <fungus.humungus@gmail.com>
@agent-platform-auto-pr
Copy link
Contributor

agent-platform-auto-pr bot commented Dec 13, 2024

Gitlab CI Configuration Changes

Modified Jobs

single-machine-performance-regression_detector
  single-machine-performance-regression_detector:
    allow_failure: false
    artifacts:
      expire_in: 1 weeks
      paths:
      - submission_metadata
      - ${CI_COMMIT_SHA}-baseline_sha
      - outputs/report.md
      - outputs/regression_signal.json
      - outputs/bounds_check_signal.json
      - outputs/junit.xml
      - outputs/report.json
      - outputs/decision_record.md
      when: always
    image: registry.ddbuild.io/ci/datadog-agent-buildimages/docker_x64$DATADOG_AGENT_BUILDIMAGES_SUFFIX:$DATADOG_AGENT_BUILDIMAGES
    needs:
    - artifacts: false
      job: single_machine_performance-amd64-a7
    rules:
    - if: $CI_COMMIT_BRANCH == "main"
      when: never
    - if: $CI_COMMIT_BRANCH =~ /^[0-9]+\.[0-9]+\.x$/
      when: never
    - if: $CI_COMMIT_BRANCH =~ /^mq-working-branch-/
      when: never
    - when: on_success
    script:
    - DATADOG_API_KEY="$("$CI_PROJECT_DIR"/tools/ci/fetch_secret.sh "$AGENT_API_KEY_ORG2"
      token)" || exit $?; export DATADOG_API_KEY
    - datadog-ci tag --level job --tags smp_failure_mode:"unknown"
    - mkdir outputs
    - git fetch origin
    - SMP_BASE_BRANCH=$(inv release.get-release-json-value base_branch --no-worktree)
    - echo "Looking for merge base for branch ${SMP_BASE_BRANCH}"
    - SMP_MERGE_BASE=$(git merge-base ${CI_COMMIT_SHA} origin/${SMP_BASE_BRANCH})
    - echo "Merge base is ${SMP_MERGE_BASE}"
    - AWS_NAMED_PROFILE="single-machine-performance"
    - SMP_ACCOUNT_ID=$($CI_PROJECT_DIR/tools/ci/fetch_secret.sh $SMP_ACCOUNT account_id)
      || exit $?
    - SMP_ECR_URL=${SMP_ACCOUNT_ID}.dkr.ecr.us-west-2.amazonaws.com
    - SMP_AGENT_TEAM_ID=$($CI_PROJECT_DIR/tools/ci/fetch_secret.sh $SMP_ACCOUNT agent_team_id)
      || exit $?
    - SMP_API=$($CI_PROJECT_DIR/tools/ci/fetch_secret.sh $SMP_ACCOUNT api_url) || exit
      $?
    - SMP_BOT_ID=$($CI_PROJECT_DIR/tools/ci/fetch_secret.sh $SMP_ACCOUNT bot_login)
      || exit $?
    - SMP_BOT_KEY=$($CI_PROJECT_DIR/tools/ci/fetch_secret.sh $SMP_ACCOUNT bot_token)
      || exit $?
    - aws configure set aws_access_key_id "$SMP_BOT_ID" --profile ${AWS_NAMED_PROFILE}
    - aws configure set aws_secret_access_key "$SMP_BOT_KEY" --profile ${AWS_NAMED_PROFILE}
    - aws configure set region us-west-2 --profile ${AWS_NAMED_PROFILE}
    - aws --profile single-machine-performance s3 cp s3://smp-cli-releases/v${SMP_VERSION}/x86_64-unknown-linux-gnu/smp
      smp
    - chmod +x smp
    - BASELINE_SHA="${SMP_MERGE_BASE}"
    - echo "Computing baseline..."
    - echo "Checking if commit ${BASELINE_SHA} is recent enough..."
    - FOUR_DAYS_BEFORE_NOW=$(date --date="-4 days +1 hour" "+%s")
    - BASELINE_COMMIT_TIME=$(git -c log.showSignature=false show --no-patch --format=%ct
      ${BASELINE_SHA})
    - "if [[ ${BASELINE_COMMIT_TIME} -le ${FOUR_DAYS_BEFORE_NOW} ]]\nthen\n    echo\
      \ \"ERROR: Merge-base of this branch is too old for SMP. Please update your branch\
      \ by merging an up-to-date main branch into your branch or by rebasing it on an\
      \ up-to-date main branch.\"\n    datadog-ci tag --level job --tags smp_failure_mode:\"\
      merge-base-too-old\"\n    exit 1\nfi\n"
    - echo "Commit ${BASELINE_SHA} is recent enough"
    - echo "Checking if image exists for commit ${BASELINE_SHA}..."
    - "while [[ ! $(aws ecr describe-images --region us-west-2 --profile single-machine-performance\
      \ --registry-id \"${SMP_ACCOUNT_ID}\" --repository-name \"${SMP_AGENT_TEAM_ID}-agent\"\
      \ --image-ids imageTag=\"${BASELINE_SHA}-7-amd64\") ]]\ndo\n    echo \"No image\
      \ exists for ${BASELINE_SHA} - checking predecessor of ${BASELINE_SHA} next\"\n\
      \    BASELINE_SHA=$(git rev-parse ${BASELINE_SHA}^)\n    echo \"Checking if commit\
      \ ${BASELINE_SHA} is recent enough...\"\n    BASELINE_COMMIT_TIME=$(git -c log.showSignature=false\
      \ show --no-patch --format=%ct ${BASELINE_SHA})\n    if [[ ${BASELINE_COMMIT_TIME}\
      \ -le ${FOUR_DAYS_BEFORE_NOW} ]]\n    then\n        echo \"ERROR: Merge-base of\
      \ this branch is too old for SMP. Please update your branch by merging an up-to-date\
      \ main branch into your branch or by rebasing it on an up-to-date main branch.\"\
      \n        datadog-ci tag --level job --tags smp_failure_mode:\"merge-base-too-old-predecessor\"\
      \n        exit 1\n    fi\n    echo \"Commit ${BASELINE_SHA} is recent enough\"\
      \n    echo \"Checking if image exists for commit ${BASELINE_SHA}...\"\ndone\n"
    - echo "Image exists for commit ${BASELINE_SHA}"
    - echo "Baseline SHA is ${BASELINE_SHA}"
    - echo -n "${BASELINE_SHA}" > "${CI_COMMIT_SHA}-baseline_sha"
    - aws s3 cp --profile single-machine-performance --only-show-errors "${CI_COMMIT_SHA}-baseline_sha"
      "s3://${SMP_AGENT_TEAM_ID}-smp-artifacts/information/"
    - BASELINE_IMAGE=${SMP_ECR_URL}/${SMP_AGENT_TEAM_ID}-agent:${BASELINE_SHA}-7-amd64
    - echo "${BASELINE_SHA} | ${BASELINE_IMAGE}"
    - COMPARISON_IMAGE=${SMP_ECR_URL}/${SMP_AGENT_TEAM_ID}-agent:${CI_COMMIT_SHA}-7-amd64
    - echo "${CI_COMMIT_SHA} | ${COMPARISON_IMAGE}"
    - SMP_TAGS="ci_pipeline_id=${CI_PIPELINE_ID},ci_job_id=${CI_JOB_ID}"
    - echo "Tags passed through SMP are ${SMP_TAGS}"
    - RUST_LOG="info,aws_config::profile::credentials=error"
    - RUST_LOG_DEBUG="debug,aws_config::profile::credentials=error"
    - "RUST_LOG=\"${RUST_LOG}\" ./smp --team-id ${SMP_AGENT_TEAM_ID} --api-base ${SMP_API}\
      \ --aws-named-profile ${AWS_NAMED_PROFILE} \\\njob submit \\\n--baseline-image\
      \ ${BASELINE_IMAGE} \\\n--comparison-image ${COMPARISON_IMAGE} \\\n--baseline-sha\
      \ ${BASELINE_SHA} \\\n--comparison-sha ${CI_COMMIT_SHA} \\\n--target-config-dir\
-     \ test/regression/ \\\n--submission-metadata submission_metadata \\\n--tags ${SMP_TAGS}\
?                                                                              ^ ^^^^^^^^^^^^
+     \ test/regression/ \\\n--submission-metadata submission_metadata \\\n--total-samples\
?                                                                             ++ ^^ ^^^^^^
-     \ || {\n  exit_code=$?\n  echo \"smp job submit command failed with code $exit_code\"\
-     \n  datadog-ci tag --level job --tags smp_failure_mode:\"job-submission\"\n  exit\
-     \ $exit_code\n}\n"
+     \ 1800\n--replicas 50 \\\n--tags ${SMP_TAGS} || {\n  exit_code=$?\n  echo \"smp\
+     \ job submit command failed with code $exit_code\"\n  datadog-ci tag --level job\
+     \ --tags smp_failure_mode:\"job-submission\"\n  exit $exit_code\n}\n"
    - SMP_JOB_ID=$(jq -r '.jobId' submission_metadata)
    - echo "SMP Job Id is ${SMP_JOB_ID}"
    - datadog-ci tag --level job --tags smp_job_id:${SMP_JOB_ID}
    - "RUST_LOG=\"${RUST_LOG}\" ./smp --team-id ${SMP_AGENT_TEAM_ID} --api-base ${SMP_API}\
      \ --aws-named-profile ${AWS_NAMED_PROFILE} \\\njob status \\\n--wait \\\n--wait-delay-seconds\
      \ 60 \\\n--submission-metadata submission_metadata || {\n  exit_code=$?\n  echo\
      \ \"smp job status command failed with code $exit_code\"\n  datadog-ci tag --level\
      \ job --tags smp_failure_mode:\"job-status\"\n  exit $exit_code\n}\n"
    - "RUST_LOG=\"${RUST_LOG}\" ./smp --team-id ${SMP_AGENT_TEAM_ID} --api-base ${SMP_API}\
      \ --aws-named-profile ${AWS_NAMED_PROFILE} \\\njob sync \\\n--submission-metadata\
      \ submission_metadata \\\n--output-path outputs || {\n  exit_code=$?\n  echo \"\
      smp job sync command failed with code $exit_code\"\n  datadog-ci tag --level job\
      \ --tags smp_failure_mode:\"job-sync\"\n  exit $exit_code\n}\n"
    - cat outputs/report.md | sed "s/^\$/$(echo -ne '\uFEFF\u00A0\u200B')/g"
    - datadog-ci junit upload --service datadog-agent outputs/junit.xml
    - datadog-ci tag --level job --tags smp_failure_mode:"none"
    - datadog-ci tag --level job --tags smp_optimization_goal:"passed"
    - "RUST_LOG=\"${RUST_LOG}\" ./smp --team-id ${SMP_AGENT_TEAM_ID} --api-base ${SMP_API}\
      \ --aws-named-profile ${AWS_NAMED_PROFILE} \\\n  job result \\\n  --submission-metadata\
      \ submission_metadata --signal regression-detector || {\n  exit_code=$?\n  echo\
      \ \"smp regression detector has detected a regression\"\n  datadog-ci tag --level\
      \ job --tags smp_optimization_goal:\"failed\"\n}\n"
    - datadog-ci tag --level job --tags smp_bounds_check:"passed"
    - "RUST_LOG=\"${RUST_LOG}\" ./smp --team-id ${SMP_AGENT_TEAM_ID} --api-base ${SMP_API}\
      \ --aws-named-profile ${AWS_NAMED_PROFILE} \\\n  job result \\\n  --submission-metadata\
      \ submission_metadata --signal bounds-check || {\n  exit_code=$?\n  echo \"smp\
      \ regression detector has detected a failed bounds check\"\n  datadog-ci tag --level\
      \ job --tags smp_bounds_check:\"failed\"\n}\n"
    - datadog-ci tag --level job --tags smp_quality_gates:"failed"
    - "python3 <<'EOF'\nimport json\nimport sys\n\ntry:\n    with open('outputs/report.json')\
      \ as f:\n        data = json.load(f)\nexcept FileNotFoundError:\n    print(\"\
      Machine readable report not found.\")\n    sys.exit(1)\nexcept json.JSONDecodeError\
      \ as e:\n    print(f\"Error parsing JSON report: {e}\")\n    sys.exit(1)\n\nexperiments\
      \ = data.get('experiments', {})\nfailed = False\ndecision_record = []\n\nfor exp_name,\
      \ exp_data in experiments.items():\n    if exp_name.startswith('quality_gate_'):\n\
      \        bounds_checks = exp_data.get('bounds_checks', {})\n        for check_name,\
      \ check_data in bounds_checks.items():\n            results = check_data.get('results',\
      \ {})\n            comparison = results.get('comparison', [])\n            num_total\
      \ = len(comparison)\n            failed_replicates = [\n                replicate\
      \ for replicate in comparison if not replicate.get('passed', False)\n        \
      \    ]\n            num_failed = len(failed_replicates)\n            num_passed\
      \ = num_total - num_failed\n            if failed_replicates:\n              \
      \  decision_record.append(\n                    f\"- **{exp_name}**, bounds check\
      \ **{check_name}**: {num_passed}/{num_total} replicas passed. Failed {num_failed}\
      \ which is > 0. Gate **FAILED**.\"\n                )\n                failed\
      \ = True\n            else:\n                decision_record.append(\n       \
      \             f\"- **{exp_name}**, bounds check **{check_name}**: {num_passed}/{num_total}\
      \ replicas passed. Gate passed.\"\n                )\n\nwith open('outputs/decision_record.md',\
      \ 'w') as f:\n    # Extra newline since this is appended to another report\n \
      \   f.write('\\n\\n## CI Pass/Fail Decision\\n\\n')\n    if failed:\n        f.write('\u274C\
      \ **Failed.** Some Quality Gates were violated.\\n\\n')\n        f.write('\\n'.join(decision_record))\n\
      \    else:\n        f.write('\u2705 **Passed.** All Quality Gates passed.\\n\\\
      n')\n        f.write('\\n'.join(decision_record))\n\nif failed:\n    print(\"\
      Quality gate failed, see decision record\")\n    sys.exit(1)\nelse:\n    print(\"\
      Quality gate passed.\")\n    sys.exit(0)\nEOF\n"
    - datadog-ci tag --level job --tags smp_quality_gates:"passed"
    stage: functional_test
    tags:
    - arch:amd64
    timeout: 1h10m
    variables:
      SMP_VERSION: 0.19.3

Changes Summary

Removed Modified Added Renamed
0 1 0 0

ℹ️ Diff available in the job log.

@louis-cqrl louis-cqrl removed the request for review from misteriaud December 13, 2024 10:50
@github-actions github-actions bot added long review PR is complex, plan time to review it and removed short review PR is simple enough to be reviewed quickly labels Dec 13, 2024
blt and others added 2 commits December 13, 2024 16:36
Signed-off-by: Brian L. Troutwine <brian.troutwine@datadoghq.com>
Signed-off-by: Stephen Wakely <fungus.humungus@gmail.com>
@StephenWakely StephenWakely requested a review from a team as a code owner December 16, 2024 12:15
@@ -126,7 +126,7 @@ func TestPayloadsEmptyServiceCheck(t *testing.T) {

func TestPayloadsServiceChecks(t *testing.T) {
config := mock.New(t)
config.Set("serializer_max_payload_size", 200, pkgconfigmodel.SourceAgentRuntime)
config.Set("serializer_max_payload_size", 250, pkgconfigmodel.SourceAgentRuntime)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max compress bound for zstd is larger than for zlib. So to get this test to pass I have had to increase this size. It does indicate that a side effect of moving to zstd will be less events per payload.

When looking at the actual payload sizes in this test we have:

With Zstd:
compressed size: 108
compressed size: 108
compressed size: 92

With Zlib:
compressed size: 116
compressed size: 113
compressed size: 96

@StephenWakely
Copy link
Contributor Author

Surprise was expressed at how the uds_dogstatsd_to_api regression test was not showing any improvement in network traffic. Digging into it further I have captured the traffic that will be output by the agent when the input is from lading. The events look similar to this, with a lot of random looking text:

{
        "msg_title": "MPjdjQcusqqmjtULPy9kDz7Cb1Sur3MFdrAMonCsWHxN3C6jSSIvoF63xk1nOfzTpgi8NiUgt",
        "msg_text": "A3hL7MlmE72HY2wbtKtZqDcw5XBIsRtOLRWN0u1RRniQniAJOKtf207Pn6VPCnqro7wZsay5SDGTuPbNy3daPbCA6yJ7Hzi5EzxVrIbRrVpCIUwGK2LjBlldYOVI7RdX4DfeCKrIUzczUKIHrqZIb9gJNk7mBMA7urdPtOgWtxU44gQF8z9pzOI8kjcSmsYOzZthJdOvWT4HaLIL7pc7ijD7CtuoaZx8W1uQQEmj98MQUvK3IXENRmzw9AcPEelHm87eUw5BhUR0qEX0ZLS7uc0yfoHxLoeCBgHWDj4W1so5U1PaFmGbAV2ZQhRzYf9nZbkEBwagyasqDI0nFCQn8O3v63cGna5bXrcioYr4kZA2m5sfreUxZO9PrYkO17nIf7yNiHnXsu1qlyN4PrdPLXfGbI3K8SG0W0Ew50IMLGfkAxFkV8MnrHLBvtFhn0qM6SPA2zm4PtZzk3pbV9kLfDyFv2kHNQfNIzK7nEgW41u3XQ2JLioDOoQmX0aBeXHXJ6Awi0pZq4XkhpQipgc1ltqXFmNJhCUGqfSwcsDgKgG8aUW4nVCZ5QTHbx4ZneNDrkzG0rxDSqEWiH58Omn6NhVevo67uwiGIqSVfrImlnwxd6jifd637GbXVi6xTAJzlof8CLPP",
        "timestamp": 3703798762,
        "priority": "normal",
        "host": "stephenwakely-Precision-5570",
        "tags": [
          "6QPCa75XGUSbmSIoS8nobBz54odGA6crqdqZ4bq05iLxIxf9K6SV017ywfcoR3XmmceaLZbv1NC0vZBokOSBEqTFMnB6oNlDO7ORzXsmaggGfOkw1:XvWHu0HtSj0JqV93tcxRvfdWIJawSZm",
          "75O9egotFhqFxxBWV0usmkFIZG9WZMEcbUBGHlXPUZo9FPvBrCJsZLUMFfTm:ROUgIfSElkGjJGYWjQTKw6FGTk6",
          "7Z4M19YAxc8UoNlwEbUm8azP9z5piO43FmHx9oDA4YWHpX:3ouIbsdvpDBhaavfEumkoyEpxK8ZoCCkCkLS",
          "7x1SDzYOREyMFdgNtNV:lMBHf",
          "9vpVBfyJ6YI5cCn6gPTxLoW17nQimuEdZueogrmz5UVPspZlzZfXYAOFUKzHGzP1qg9CJ3l9:Ix2AJXUGL6M9ATLjY14GpUoB1LCfPYiNao3VWp3isrrFgLNmT9ZG",
          "9xYkgAiFVFsoTR7KZwUCXuw80th6RnjBRi2xIh9bhyhYu5V4uOY3YnexLd5snXg3IQ5HtBB6ugjqq8LMwIK0M9JjkczpK4JFJb:ccaAbfZjiPH",
          "Es8ReCS0qpEajpCmtbe:PF1VNVj9YbqKPIy7mndqPX00F8zY",
          "FUtLhb:elzioLrdRL4Ft9csH463Z67HanbxbvvL4oxVdf1",
          "IJd67ElZBs4VrUhNCpNPVFXb6o1tDlXl9l8IFNWFykKGZ7UXlRbQAw6WC6SsoX5N:H1Tg9TeIwJXSCf4wL7wY9qQ9CJhUFVHwi6j0WJr2kjKbSSK3uvoREybZ",
          "LUzaNnVVaDdkwJVHNrVQqAuf8O8rr9E:hFASuKMaN38tClAzai2otBOWp94On",
          "MkEBE7xRrV42F8gGKH4jNCRHXZD1tprH9AspaNZboA73jIK00vS2Agz:fkgUEuJsxGcyImFzAOLQ09qDMhw",
          "Ov8wgLRq7:NA",
          "WtgFVnpidx3Rv1YF5tSWaFO:4NyvzDnBwv60mbYX3WbIR",
          "X3:4ODg32MNFUtWThw2tD3piqfwB9WgrKOZlqpNxpz9A6iox6",
          "YfPgCBKz189x:ECwu5OaLrnSupwkCbzNiEf8O43X6WHfVs",
          "ZMPdlSy17FJEbqtCyGCIFiCBmzOdZR3l7QnKJ5gspjQv:ris8BpWOBAuwh56nuEsjtfmWG6",
          "j:t8XpBdUDMpGU9aWzv1srfyvhMcVSCluP6LhbEKsVa3cCALh0DSSLVdq98RlbqaN33ZK5BayADgUYuJhbLjnhquHxU3QU82iayaNgWcWi13MoMVY5quGfC",
          "la:JOVp4s9I7l",
          "ulcOR6rrkXi6RQ9kF:KpAQEVaIeG1lTPCMwUBT19BShAl1N2FUFp1"
        ],
        "alert_type": "warning",
        "source_type_name": "sbYA4xM"
      }

Dumping this to a file and compressing those files produces:

$ ls -l payload38*
.rw-r----- 3.7M stephenwakely 16 Dec 15:08 payload38.uncompressed
.rw-r----- 2.5M stephenwakely 16 Dec 15:26 payload38.zlib
.rw-r----- 2.0M stephenwakely 16 Dec 15:24 payload38.zstd5
.rw-r----- 2.5M stephenwakely 16 Dec 15:27 payload38.zstd1

There is no difference in file size between zlib and zstd1 (zstd at level 1).

It is worth noting that there is a significant difference at level 5.

I would posit that the lack of improvement here is due to the high randomness of the data and more uniform data would produce improvements. I have observed this in production environments.

Comment on lines 116 to 117
--total-samples 1800
--replicas 50 \
Copy link
Contributor

@remeh remeh Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to merge with these configuration changes?

enhancements:
- |
Metric payloads are compressed using `zstd` compression by default.
This can be reverted to the previous compression kind by adding
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This can be reverted to the previous compression kind by adding
This can be reverted to the previous compression algorithm by adding

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see why you've been with kind

@StephenWakely StephenWakely added the qa/done QA done before merge and regressions are covered by tests label Dec 17, 2024
blt and others added 2 commits December 17, 2024 18:46
Signed-off-by: Brian L. Troutwine <brian.troutwine@datadoghq.com>
@agent-platform-auto-pr
Copy link
Contributor

agent-platform-auto-pr bot commented Dec 18, 2024

Uncompressed package size comparison

Comparison with ancestor 6d0ce2d61353b6fca280b60e157c06642515fc9c

Diff per package
package diff status size ancestor threshold
datadog-heroku-agent-amd64-deb 1.40MB ⚠️ 506.45MB 505.05MB 70.00MB
datadog-dogstatsd-amd64-deb 0.00MB 78.59MB 78.59MB 10.00MB
datadog-dogstatsd-x86_64-rpm 0.00MB 78.67MB 78.67MB 10.00MB
datadog-dogstatsd-x86_64-suse 0.00MB 78.67MB 78.67MB 10.00MB
datadog-dogstatsd-arm64-deb 0.00MB 55.79MB 55.79MB 10.00MB
datadog-iot-agent-amd64-deb 0.00MB 113.31MB 113.31MB 10.00MB
datadog-iot-agent-x86_64-rpm 0.00MB 113.38MB 113.38MB 10.00MB
datadog-iot-agent-x86_64-suse 0.00MB 113.38MB 113.38MB 10.00MB
datadog-iot-agent-arm64-deb 0.00MB 108.78MB 108.78MB 10.00MB
datadog-iot-agent-aarch64-rpm 0.00MB 108.84MB 108.84MB 10.00MB
datadog-agent-x86_64-rpm -0.01MB 1196.99MB 1197.00MB 140.00MB
datadog-agent-x86_64-suse -0.01MB 1196.99MB 1197.00MB 140.00MB
datadog-agent-amd64-deb -0.01MB 1187.75MB 1187.76MB 140.00MB
datadog-agent-aarch64-rpm -0.02MB 942.98MB 943.00MB 140.00MB
datadog-agent-arm64-deb -0.02MB 933.77MB 933.79MB 140.00MB

Decision

⚠️ Warning

Signed-off-by: Stephen Wakely <fungus.humungus@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
long review PR is complex, plan time to review it qa/done QA done before merge and regressions are covered by tests team/agent-processing-and-routing team/agent-shared-components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants