Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Add workflow template informer to server #13672

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

jakkubu
Copy link

@jakkubu jakkubu commented Sep 27, 2024

Motivation

Improve performance of creating workflows with complex templateRef structure.

During template validation k8s API is called for each templateRef. For complex workflows with many refs it creates huge overhead. Let's cache such templates

Connected to issue #7418

This is a follow up PR #13633

Modifications

Added informer to the server and use it in workflow validation

Verification

I run the tests similar to the ones in 1st PR. The results are awesome - benchmarking results and details in separate comment

@jakkubu jakkubu changed the title Add workflow template informer to server perf: Add workflow template informer to server Sep 27, 2024
@jakkubu jakkubu force-pushed the add-server-informer branch 3 times, most recently from 9dac1ce to 3d90e33 Compare October 9, 2024 08:07
@jakkubu
Copy link
Author

jakkubu commented Oct 10, 2024

Benchmarking multiple-ref template creation

Setup

Branches:

  1. Main (commit 5244064)
  2. Rebased to above commit and changes from PR: perf: Add template validation caching #13633 (commit 9df7abf)

Using fresh kind cluster v1.28.9

Argo server started with server --auth-mode=server --auth-mode=client --kube-api-burst=200 --kube-api-qps=200

Benchmark workflow templates are placed in test/benchmarks/*.yaml.

Before each tests following procedure were followed:

  1. Delete all workflow
  2. Wait for all workflows pods to be removed
  3. Restart controller and server

Benchmarking tool: hey. It runs command in parallel by default 200 times using 50 workers. Those values can be modified using:

  • -n: number of requests
  • -c: number of workers

Typical call is described in test/benchmarks/README.md.

Results

Requests Workers Template No cache ART [s] Manual cache ART [s] Informer ART [s]
200 50 20-echos deadline exceeded 9.1370 0.0833
50 2 20-echos 4.3682 0.3974 0.0119
16 8 20-echos 18.0247 1.0127 0.0290
50 1 20-echos 2.3204 0.2095 0.0273
200 50 echo-1 11.7437 4.3005 0.0362

*ART - Average Response Time

Appendix

Manual Caching hey output

Manual Caching hey output

hey \
    -n 200 -c 50 \
    -m POST \
    -disable-keepalive \
    -T "application/json" \
    -d '{
        "serverDryRun": false,
        "workflow": {
            "metadata": {
                "generateName": "curl-echo-test-",
                "namespace": "argo-test"
            },
            "spec": {
                "workflowTemplateRef": {"name": "20-echos"},
                "arguments": {}
            }
        }
        }' \
    https://localhost:2746/api/v1/workflows/argo-test

Summary:
  Total:	38.5292 secs
  Slowest:	10.0132 secs
  Fastest:	0.2812 secs
  Average:	9.1370 secs
  Requests/sec:	5.1909


Response time histogram:
  0.281 [1]	|
  1.254 [0]	|
  2.228 [2]	|■
  3.201 [3]	|■
  4.174 [3]	|■
  5.147 [1]	|
  6.120 [0]	|
  7.094 [6]	|■■
  8.067 [8]	|■■
  9.040 [26]	|■■■■■■■
  10.013 [150]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■


Latency distribution:
  10% in 7.8264 secs
  25% in 9.0510 secs
  50% in 9.9969 secs
  75% in 9.9999 secs
  90% in 10.0010 secs
  95% in 10.0017 secs
  99% in 10.0090 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0086 secs, 0.2812 secs, 10.0132 secs
  DNS-lookup:	0.0008 secs, 0.0003 secs, 0.0028 secs
  req write:	0.0001 secs, 0.0000 secs, 0.0009 secs
  resp wait:	9.1282 secs, 0.2608 secs, 10.0097 secs
  resp read:	0.0001 secs, 0.0001 secs, 0.0006 secs

Status code distribution:
  [200]	200 responses
hey \
    -n 50 -c 1 \
    -m POST \
    -disable-keepalive \
    -T "application/json" \
    -d '{
        "serverDryRun": false,
        "workflow": {
            "metadata": {
                "generateName": "curl-echo-test-",
                "namespace": "argo-test"
            },
            "spec": {
                "workflowTemplateRef": {"name": "20-echos"},
                "arguments": {}
            }
        }
        }' \
    https://localhost:2746/api/v1/workflows/argo-test

Summary:
  Total:	10.4736 secs
  Slowest:	2.8200 secs
  Fastest:	0.0116 secs
  Average:	0.2095 secs
  Requests/sec:	4.7739


Response time histogram:
  0.012 [1]	|■
  0.292 [43]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.573 [2]	|■■
  0.854 [2]	|■■
  1.135 [1]	|■
  1.416 [0]	|
  1.697 [0]	|
  1.977 [0]	|
  2.258 [0]	|
  2.539 [0]	|
  2.820 [1]	|■


Latency distribution:
  10% in 0.0215 secs
  25% in 0.0398 secs
  50% in 0.0904 secs
  75% in 0.2085 secs
  90% in 0.4544 secs
  95% in 0.9240 secs
  0% in 0.0000 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0150 secs, 0.0116 secs, 2.8200 secs
  DNS-lookup:	0.0004 secs, 0.0003 secs, 0.0022 secs
  req write:	0.0001 secs, 0.0000 secs, 0.0003 secs
  resp wait:	0.1941 secs, 0.0084 secs, 2.7740 secs
  resp read:	0.0003 secs, 0.0001 secs, 0.0037 secs

Status code distribution:
  [200]	50 responses
hey \
    -n 16 -c 8 \
    -m POST \
    -disable-keepalive \
    -T "application/json" \
    -d '{
        "serverDryRun": false,
        "workflow": {
            "metadata": {
                "generateName": "curl-echo-test-",
                "namespace": "argo-test"
            },
            "spec": {
                "workflowTemplateRef": {"name": "20-echos"},
                "arguments": {}
            }
        }
        }' \
    https://localhost:2746/api/v1/workflows/argo-test

Summary:
  Total:	2.0559 secs
  Slowest:	1.9955 secs
  Fastest:	0.0570 secs
  Average:	1.0127 secs
  Requests/sec:	7.7826


Response time histogram:
  0.057 [1]	|■■■■■
  0.251 [7]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.445 [0]	|
  0.639 [0]	|
  0.832 [0]	|
  1.026 [0]	|
  1.220 [0]	|
  1.414 [0]	|
  1.608 [0]	|
  1.802 [0]	|
  1.996 [8]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■


Latency distribution:
  10% in 0.0593 secs
  25% in 0.0601 secs
  50% in 1.9162 secs
  75% in 1.9707 secs
  90% in 1.9955 secs
  0% in 0.0000 secs
  0% in 0.0000 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0140 secs, 0.0570 secs, 1.9955 secs
  DNS-lookup:	0.0017 secs, 0.0001 secs, 0.0032 secs
  req write:	0.0001 secs, 0.0000 secs, 0.0003 secs
  resp wait:	0.9971 secs, 0.0330 secs, 1.9879 secs
  resp read:	0.0004 secs, 0.0000 secs, 0.0013 secs

Status code distribution:
  [200]	16 responses
 hey \
    -m POST \
    -disable-keepalive \
    -T "application/json" \
    -d '{
        "serverDryRun": false,
        "workflow": {
            "metadata": {
                "generateName": "curl-echo-test-",
                "namespace": "argo-test"
            },
            "spec": {
                "workflowTemplateRef": {"name": "echo-1"},
                "arguments": {}
            }
        }
        }' \
    https://localhost:2746/api/v1/workflows/argo-test

Summary:
  Total:	18.5190 secs
  Slowest:	5.0074 secs
  Fastest:	0.0267 secs
  Average:	4.3005 secs
  Requests/sec:	10.7997

  Total data:	200600 bytes
  Size/request:	1003 bytes

Response time histogram:
  0.027 [1]	|
  0.525 [2]	|■
  1.023 [0]	|
  1.521 [8]	|■■
  2.019 [9]	|■■
  2.517 [7]	|■■
  3.015 [9]	|■■
  3.513 [10]	|■■■
  4.011 [3]	|■
  4.509 [2]	|■
  5.007 [149]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■


Latency distribution:
  10% in 2.0682 secs
  25% in 4.3698 secs
  50% in 4.9992 secs
  75% in 5.0003 secs
  90% in 5.0012 secs
  95% in 5.0018 secs
  99% in 5.0058 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0076 secs, 0.0267 secs, 5.0074 secs
  DNS-lookup:	0.0013 secs, 0.0002 secs, 0.0070 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0007 secs
  resp wait:	4.2928 secs, 0.0135 secs, 5.0043 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0003 secs

Status code distribution:
  [200]	200 responses
hey \
    -n 50 -c 2 \
    -m POST \
    -disable-keepalive \
    -T "application/json" \
    -d '{
        "serverDryRun": false,
        "workflow": {
            "metadata": {
                "generateName": "curl-echo-test-",
                "namespace": "argo-test"
            },
            "spec": {
                "workflowTemplateRef": {"name": "20-echos"},
                "arguments": {}
            }
        }
        }' \
    https://localhost:2746/api/v1/workflows/argo-test

Summary:
  Total:	9.9623 secs
  Slowest:	2.1692 secs
  Fastest:	0.0177 secs
  Average:	0.3974 secs
  Requests/sec:	5.0189


Response time histogram:
  0.018 [1]	|■■
  0.233 [21]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.448 [19]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.663 [0]	|
  0.878 [4]	|■■■■■■■■
  1.093 [1]	|■■
  1.309 [0]	|
  1.524 [1]	|■■
  1.739 [0]	|
  1.954 [2]	|■■■■
  2.169 [1]	|■■


Latency distribution:
  10% in 0.0415 secs
  25% in 0.0680 secs
  50% in 0.3625 secs
  75% in 0.4159 secs
  90% in 1.0482 secs
  95% in 1.8283 secs
  0% in 0.0000 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0151 secs, 0.0177 secs, 2.1692 secs
  DNS-lookup:	0.0005 secs, 0.0001 secs, 0.0015 secs
  req write:	0.0001 secs, 0.0000 secs, 0.0003 secs
  resp wait:	0.3813 secs, 0.0131 secs, 2.1538 secs
  resp read:	0.0009 secs, 0.0001 secs, 0.0175 secs

Status code distribution:
  [200]	50 responses
Caching OFF hey output

Caching OFF hey output

hey \
    -m POST \
    -disable-keepalive \
    -T "application/json" \
    -d '{
        "serverDryRun": false,
        "workflow": {
            "metadata": {
                "generateName": "curl-echo-test-",
                "namespace": "argo-test"
            },
            "spec": {
                "workflowTemplateRef": {"name": "20-echos"},
                "arguments": {}
            }
        }
        }' \
    https://localhost:2746/api/v1/workflows/argo-test

Error distribution:
  [1]	Post "https://localhost:2746/api/v1/workflows/argo-test": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
hey \
    -n 50 -c 1 \
    -m POST \
    -disable-keepalive \
    -T "application/json" \
    -d '{
        "serverDryRun": false,
        "workflow": {
            "metadata": {
                "generateName": "curl-echo-test-",
                "namespace": "argo-test"
            },
            "spec": {
                "workflowTemplateRef": {"name": "20-echos"},
                "arguments": {}
            }
        }
        }' \
    https://localhost:2746/api/v1/workflows/argo-test

Summary:
  Total:	116.0186 secs
  Slowest:	3.0743 secs
  Fastest:	0.8805 secs
  Average:	2.3204 secs
  Requests/sec:	0.4310


Response time histogram:
  0.881 [1]	|■
  1.100 [0]	|
  1.319 [1]	|■
  1.539 [0]	|
  1.758 [0]	|
  1.977 [0]	|
  2.197 [0]	|
  2.416 [46]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  2.636 [0]	|
  2.855 [1]	|■
  3.074 [1]	|■


Latency distribution:
  10% in 2.3432 secs
  25% in 2.3488 secs
  50% in 2.3497 secs
  75% in 2.3521 secs
  90% in 2.3539 secs
  95% in 2.8334 secs
  0% in 0.0000 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0046 secs, 0.8805 secs, 3.0743 secs
  DNS-lookup:	0.0005 secs, 0.0003 secs, 0.0014 secs
  req write:	0.0001 secs, 0.0000 secs, 0.0001 secs
  resp wait:	2.3155 secs, 0.8690 secs, 3.0706 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0007 secs

Status code distribution:
  [200]	50 responses
hey \
    -n 10 -c 2 \
    -m POST \
    -disable-keepalive \
    -T "application/json" \
    -d '{
        "serverDryRun": false,
        "workflow": {
            "metadata": {
                "generateName": "curl-echo-test-",
                "namespace": "argo-test"
            },
            "spec": {
                "workflowTemplateRef": {"name": "20-echos"},
                "arguments": {}
            }
        }
        }' \
    https://localhost:2746/api/v1/workflows/argo-test

Summary:
  Total:	22.0162 secs
  Slowest:	5.2764 secs
  Fastest:	3.1682 secs
  Average:	4.3682 secs
  Requests/sec:	0.4542


Response time histogram:
  3.168 [1]	|■■■■■■■■■■■■■
  3.379 [1]	|■■■■■■■■■■■■■
  3.590 [0]	|
  3.801 [0]	|
  4.011 [0]	|
  4.222 [2]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■
  4.433 [1]	|■■■■■■■■■■■■■
  4.644 [1]	|■■■■■■■■■■■■■
  4.855 [1]	|■■■■■■■■■■■■■
  5.066 [0]	|
  5.276 [3]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■


Latency distribution:
  10% in 3.2170 secs
  25% in 4.1037 secs
  50% in 4.4375 secs
  75% in 5.2165 secs
  90% in 5.2764 secs
  0% in 0.0000 secs
  0% in 0.0000 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0101 secs, 3.1682 secs, 5.2764 secs
  DNS-lookup:	0.0010 secs, 0.0004 secs, 0.0025 secs
  req write:	0.0001 secs, 0.0000 secs, 0.0001 secs
  resp wait:	4.3578 secs, 3.1597 secs, 5.2671 secs
  resp read:	0.0002 secs, 0.0001 secs, 0.0006 secs

Status code distribution:
  [200]	10 responses
 hey \
    -n 16 -c 8 \
    -m POST \
    -disable-keepalive \
    -T "application/json" \
    -d '{
        "serverDryRun": false,
        "workflow": {
            "metadata": {
                "generateName": "curl-echo-test-",
                "namespace": "argo-test"
            },
            "spec": {
                "workflowTemplateRef": {"name": "20-echos"},
                "arguments": {}
            }
        }
        }' \
    https://localhost:2746/api/v1/workflows/argo-test

Summary:
  Total:	36.8523 secs
  Slowest:	19.6692 secs
  Fastest:	16.5928 secs
  Average:	18.0247 secs
  Requests/sec:	0.4342


Response time histogram:
  16.593 [1]	|■■■■■■■■■■
  16.900 [0]	|
  17.208 [3]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  17.516 [4]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  17.823 [0]	|
  18.131 [0]	|
  18.439 [0]	|
  18.746 [4]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  19.054 [2]	|■■■■■■■■■■■■■■■■■■■■
  19.362 [0]	|
  19.669 [2]	|■■■■■■■■■■■■■■■■■■■■


Latency distribution:
  10% in 17.1318 secs
  25% in 17.2294 secs
  50% in 18.4704 secs
  75% in 18.7662 secs
  90% in 19.6692 secs
  0% in 0.0000 secs
  0% in 0.0000 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0123 secs, 16.5928 secs, 19.6692 secs
  DNS-lookup:	0.0013 secs, 0.0003 secs, 0.0021 secs
  req write:	0.0001 secs, 0.0000 secs, 0.0002 secs
  resp wait:	18.0119 secs, 16.5720 secs, 19.6658 secs
  resp read:	0.0002 secs, 0.0001 secs, 0.0008 secs

Status code distribution:
  [200]	16 responses
hey \
    -m POST \
    -disable-keepalive \
    -T "application/json" \
    -d '{
        "serverDryRun": false,
        "workflow": {
            "metadata": {
                "generateName": "curl-echo-test-",
                "namespace": "argo-test"
            },
            "spec": {
                "workflowTemplateRef": {"name": "echo-1"},
                "arguments": {}
            }
        }
        }' \
    https://localhost:2746/api/v1/workflows/argo-test

Summary:
  Total:	48.5225 secs
  Slowest:	12.7513 secs
  Fastest:	6.5247 secs
  Average:	11.7437 secs
  Requests/sec:	4.1218

  Total data:	200600 bytes
  Size/request:	1003 bytes

Response time histogram:
  6.525 [1]	|
  7.147 [5]	|■■
  7.770 [2]	|■
  8.393 [1]	|
  9.015 [0]	|
  9.638 [7]	|■■
  10.261 [9]	|■■■
  10.883 [13]	|■■■■
  11.506 [12]	|■■■■
  12.129 [33]	|■■■■■■■■■■■
  12.751 [117]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■


Latency distribution:
  10% in 9.9773 secs
  25% in 12.0355 secs
  50% in 12.4015 secs
  75% in 12.4993 secs
  90% in 12.5020 secs
  95% in 12.5169 secs
  99% in 12.6785 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0083 secs, 6.5247 secs, 12.7513 secs
  DNS-lookup:	0.0008 secs, 0.0002 secs, 0.0030 secs
  req write:	0.0001 secs, 0.0000 secs, 0.0013 secs
  resp wait:	11.7352 secs, 6.5076 secs, 12.7478 secs
  resp read:	0.0001 secs, 0.0001 secs, 0.0020 secs

Status code distribution:
  [200]	200 responses
Informer Hey outputs

Informer Hey outputs

hey \
    -n 50 -c 1 \
    -m POST \
    -disable-keepalive \
    -T "application/json" \
    -d '{
        "serverDryRun": false,
        "workflow": {
            "metadata": {
                "generateName": "curl-echo-test-",
                "namespace": "argo-test",
                "labels": {
                    "workflows.argoproj.io/benchmark": "true"
                }
            },
            "spec": {
                "workflowTemplateRef": {"name": "20-echos"},
                "arguments": {},
                "podMetadata": {
                    "labels": {
                        "workflows.argoproj.io/benchmark": "true"
                    }
                }
            }
        }
        }' \
    https://localhost:2746/api/v1/workflows/argo-test

Summary:
  Total:	1.3652 secs
  Slowest:	0.0959 secs
  Fastest:	0.0093 secs
  Average:	0.0273 secs
  Requests/sec:	36.6243


Response time histogram:
  0.009 [1]	|■■
  0.018 [11]	|■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.027 [17]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.035 [8]	|■■■■■■■■■■■■■■■■■■■
  0.044 [10]	|■■■■■■■■■■■■■■■■■■■■■■■■
  0.053 [2]	|■■■■■
  0.061 [0]	|
  0.070 [0]	|
  0.079 [0]	|
  0.087 [0]	|
  0.096 [1]	|■■


Latency distribution:
  10% in 0.0111 secs
  25% in 0.0185 secs
  50% in 0.0249 secs
  75% in 0.0377 secs
  90% in 0.0408 secs
  95% in 0.0524 secs
  0% in 0.0000 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0096 secs, 0.0093 secs, 0.0959 secs
  DNS-lookup:	0.0005 secs, 0.0003 secs, 0.0013 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0001 secs
  resp wait:	0.0174 secs, 0.0060 secs, 0.0581 secs
  resp read:	0.0002 secs, 0.0001 secs, 0.0036 secs

Status code distribution:
  [200]	50 responses
hey \
    -n 16 -c 8 \
    -m POST \
    -disable-keepalive \
    -T "application/json" \
    -d '{
        "serverDryRun": false,
        "workflow": {
            "metadata": {
                "generateName": "curl-echo-test-",
                "namespace": "argo-test",
                "labels": {
                    "workflows.argoproj.io/benchmark": "true"
                }
            },
            "spec": {
                "workflowTemplateRef": {"name": "20-echos"},
                "arguments": {},
                "podMetadata": {
                    "labels": {
                        "workflows.argoproj.io/benchmark": "true"
                    }
                }
            }
        }
        }' \
    https://localhost:2746/api/v1/workflows/argo-test

Summary:
  Total:	0.0608 secs
  Slowest:	0.0422 secs
  Fastest:	0.0159 secs
  Average:	0.0290 secs
  Requests/sec:	263.1996


Response time histogram:
  0.016 [1]	|■■■■■■■■
  0.019 [3]	|■■■■■■■■■■■■■■■■■■■■■■■■
  0.021 [3]	|■■■■■■■■■■■■■■■■■■■■■■■■
  0.024 [1]	|■■■■■■■■
  0.026 [0]	|
  0.029 [0]	|
  0.032 [0]	|
  0.034 [0]	|
  0.037 [0]	|
  0.040 [5]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.042 [3]	|■■■■■■■■■■■■■■■■■■■■■■■■


Latency distribution:
  10% in 0.0163 secs
  25% in 0.0193 secs
  50% in 0.0389 secs
  75% in 0.0393 secs
  90% in 0.0422 secs
  0% in 0.0000 secs
  0% in 0.0000 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0091 secs, 0.0159 secs, 0.0422 secs
  DNS-lookup:	0.0012 secs, 0.0002 secs, 0.0020 secs
  req write:	0.0001 secs, 0.0000 secs, 0.0005 secs
  resp wait:	0.0190 secs, 0.0098 secs, 0.0280 secs
  resp read:	0.0002 secs, 0.0000 secs, 0.0010 secs

Status code distribution:
  [200]	16 responses
hey \
    -n 50 -c 2 \
    -m POST \
    -disable-keepalive \
    -T "application/json" \
    -d '{
        "serverDryRun": false,
        "workflow": {
            "metadata": {
                "generateName": "curl-echo-test-",
                "namespace": "argo-test",
                "labels": {
                    "workflows.argoproj.io/benchmark": "true"
                }
            },
            "spec": {
                "workflowTemplateRef": {"name": "20-echos"},
                "arguments": {},
                "podMetadata": {
                    "labels": {
                        "workflows.argoproj.io/benchmark": "true"
                    }
                }
            }
        }
        }' \
    https://localhost:2746/api/v1/workflows/argo-test

Summary:
  Total:	0.3063 secs
  Slowest:	0.0279 secs
  Fastest:	0.0083 secs
  Average:	0.0119 secs
  Requests/sec:	163.2477


Response time histogram:
  0.008 [1]	|■■
  0.010 [21]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.012 [17]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.014 [3]	|■■■■■■
  0.016 [2]	|■■■■
  0.018 [1]	|■■
  0.020 [3]	|■■■■■■
  0.022 [0]	|
  0.024 [0]	|
  0.026 [0]	|
  0.028 [2]	|■■■■


Latency distribution:
  10% in 0.0089 secs
  25% in 0.0096 secs
  50% in 0.0104 secs
  75% in 0.0122 secs
  90% in 0.0185 secs
  95% in 0.0279 secs
  0% in 0.0000 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0043 secs, 0.0083 secs, 0.0279 secs
  DNS-lookup:	0.0004 secs, 0.0000 secs, 0.0021 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0001 secs
  resp wait:	0.0075 secs, 0.0054 secs, 0.0174 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0002 secs

Status code distribution:
  [200]	50 responses
hey \
    -n 200 -c 50 \
    -m POST \
    -disable-keepalive \
    -T "application/json" \
    -d '{
        "serverDryRun": false,
        "workflow": {
            "metadata": {
                "generateName": "curl-echo-test-",
                "namespace": "argo-test",
                "labels": {
                    "workflows.argoproj.io/benchmark": "true"
                }
            },
            "spec": {
                "workflowTemplateRef": {"name": "20-echos"},
                "arguments": {},
                "podMetadata": {
                    "labels": {
                        "workflows.argoproj.io/benchmark": "true"
                    }
                }
            }
        }
        }' \
    https://localhost:2746/api/v1/workflows/argo-test

Summary:
  Total:	0.3854 secs
  Slowest:	0.1707 secs
  Fastest:	0.0137 secs
  Average:	0.0833 secs
  Requests/sec:	518.9769


Response time histogram:
  0.014 [1]	|■
  0.029 [1]	|■
  0.045 [13]	|■■■■■■■■■■
  0.061 [21]	|■■■■■■■■■■■■■■■■
  0.076 [49]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.092 [54]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.108 [27]	|■■■■■■■■■■■■■■■■■■■■
  0.124 [14]	|■■■■■■■■■■
  0.139 [13]	|■■■■■■■■■■
  0.155 [6]	|■■■■
  0.171 [1]	|■


Latency distribution:
  10% in 0.0480 secs
  25% in 0.0686 secs
  50% in 0.0810 secs
  75% in 0.0956 secs
  90% in 0.1353 secs
  95% in 0.1381 secs
  99% in 0.1471 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0235 secs, 0.0137 secs, 0.1707 secs
  DNS-lookup:	0.0009 secs, 0.0000 secs, 0.0037 secs
  req write:	0.0001 secs, 0.0000 secs, 0.0010 secs
  resp wait:	0.0592 secs, 0.0056 secs, 0.1481 secs
  resp read:	0.0005 secs, 0.0000 secs, 0.0089 secs

Status code distribution:
  [200]	200 responses
hey \
    -n 200 -c 50 \
    -m POST \
    -disable-keepalive \
    -T "application/json" \
    -d '{
        "serverDryRun": false,
        "workflow": {
            "metadata": {
                "generateName": "curl-echo-test-",
                "namespace": "argo-test",
                "labels": {
                    "workflows.argoproj.io/benchmark": "true"
                }
            },
            "spec": {
                "workflowTemplateRef": {"name": "echo-1"},
                "arguments": {},
                "podMetadata": {
                    "labels": {
                        "workflows.argoproj.io/benchmark": "true"
                    }
                }
            }
        }
        }' \
    https://localhost:2746/api/v1/workflows/argo-test

Summary:
  Total:	0.1556 secs
  Slowest:	0.0568 secs
  Fastest:	0.0171 secs
  Average:	0.0362 secs
  Requests/sec:	1285.0704

  Total data:	230400 bytes
  Size/request:	1152 bytes

Response time histogram:
  0.017 [1]	|■
  0.021 [2]	|■■
  0.025 [7]	|■■■■■■
  0.029 [29]	|■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.033 [44]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.037 [32]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.041 [33]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.045 [22]	|■■■■■■■■■■■■■■■■■■■■
  0.049 [13]	|■■■■■■■■■■■■
  0.053 [9]	|■■■■■■■■
  0.057 [8]	|■■■■■■■


Latency distribution:
  10% in 0.0277 secs
  25% in 0.0299 secs
  50% in 0.0356 secs
  75% in 0.0418 secs
  90% in 0.0480 secs
  95% in 0.0524 secs
  99% in 0.0566 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0184 secs, 0.0171 secs, 0.0568 secs
  DNS-lookup:	0.0008 secs, 0.0000 secs, 0.0024 secs
  req write:	0.0001 secs, 0.0000 secs, 0.0011 secs
  resp wait:	0.0175 secs, 0.0063 secs, 0.0279 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0006 secs

Status code distribution:
  [200]	200 responses

@jakkubu jakkubu force-pushed the add-server-informer branch 5 times, most recently from 653023b to f1f89a9 Compare October 11, 2024 10:07
@jakkubu jakkubu marked this pull request as ready for review October 11, 2024 10:50
}

func (a *argoKubeClient) startStores(restConfig *restclient.Config, namespace string) error {
if a.opts.UseCaching {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UseCaching appears to be always false

Copy link
Author

@jakkubu jakkubu Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the intention - not to introduce breaking change. In the same time my team is using argoKubeClient in code and we would like to enable caching here. The code that depends on this is tested - it's basically server code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why you consider the caching version a breaking change? What does it break?

This PR is marked as a performance improvement, but doesn't improve the performance of the product, only of your usage of it as a go-client? Why wouldn't everyone want this enabled? It uses more memory...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I'm facing is that there is little testing happening in pkg/apiclient.
I could expose this option in CLI to run e2e, to make it more testable. However I don't think this option make sense in CLI. Informer would simply make startup time longer - in very specific conditions this could make some difference. Even in such case you could simply connect to server that has caching enabled by default, instead of using k8s connection.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is marked as a performance improvement, but doesn't improve the performance of the product, only of your usage of it as a go-client? Why wouldn't everyone want this enabled? It uses more memory..

This is enabled by default for argo server and all tests for argo server are using this imrovement. This part disables it for argocli user-facing commands in a case that you are using kubectl connection. So if you use argocli for submitting workflow you won't need to wait for informer to synchronise all templates.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Joibel can you re-review this PR. I did answer your comments - if this is not enough please let me know.

@@ -37,14 +37,34 @@ var (
NoArgoServerErr = fmt.Errorf("this is impossible if you are not using the Argo Server, see %s", help.CLI())
)

type ArgoKubeOpts struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct is never used as initialised in this code, nor are there any tests for UseCaching = true.

I believe this might be "for the future" but please could it not be included in this PR and saved for a future one until it's tested and used.

Copy link
Author

@jakkubu jakkubu Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for the code for programatic users of argo-kube-client. As there are no tests for such usage. It doesn't make sense to enable it by default, since it will be enabled in cli, which doesn't make sense (we don't want to have informer in cli).
Second use case for this structure is SDK, and this config depends on use-case. For one-time submit it doesn't make sense, for long-running process - it does. However I'm not sure which segment we are targeting, I'm assuming that it's betterr to keep default as is.
Server doesn't use this code for its startup and it's enabled there, as there are obvious benefits there.

test/benchmarks/README.md Outdated Show resolved Hide resolved
server/workflowtemplate/informer.go Outdated Show resolved Hide resolved
server/workflowtemplate/wf_client_store.go Outdated Show resolved Hide resolved
@tooptoop4
Copy link
Contributor

does #13763 affect this?

@jakkubu
Copy link
Author

jakkubu commented Oct 31, 2024

does #13763 affect this?

I don't think so. This PR is adding informer to server, whilst the root issue (as identified by @Joibel in his comment) is in controller logic.

@jakkubu
Copy link
Author

jakkubu commented Nov 4, 2024

New benchmarking implementation results.

How to run:

make BenchmarkArgoServer

Results:

  • original (): 890ms (890185243 ns/op)
    increase benchmark time to 20s, to have more executions, with default one there were only 2
  • after adding informer 9ms 9067016 ns/op
    using default time, as with 20s, there were over 3k workflows created, which might skew the result

This is in line with manual benchmarks, which shows ~100x improvement using informer platform.

 make BenchmarkArgoServer
GIT_COMMIT=6dfb464b60ed7149132fc188af6a6168b7ea46f1 GIT_BRANCH=add-server-informer GIT_TAG=untagged GIT_TREE_STATE=dirty RELEASE_TAG=false DEV_BRANCH=true VERSION=latest
KUBECTX=k3d-k3s-default DOCKER_DESKTOP=false K3D=true DOCKER_PUSH=false TARGET_PLATFORM=linux/arm64
RUN_MODE=local PROFILE=minimal AUTH_MODE=hybrid SECURE=false  STATIC_FILES=false ALWAYS_OFFLOAD_NODE_STATUS=false UPPERIO_DB_DEBUG=0 LOG_LEVEL=debug NAMESPACED=true
go test --tags api,cli,cron,executor,examples,corefunctional,functional,plugins ./test/e2e -run='BenchmarkArgoServer' -benchmem -bench 'BenchmarkArgoServer'  .
WARN[0000] Non-transient error: <nil>                   
WARN[0000] Non-transient error: <nil>                   
Creating workflow template multiple-ref-echo-1
Creating workflow template multiple-ref-echo-2
Creating workflow template multiple-ref-main
goos: linux
goarch: arm64
pkg: github.com/argoproj/argo-workflows/v3/test/e2e
BenchmarkArgoServer/Submit_workflow_with_multiple_refs-12                    118           9067016 ns/op          137881 B/op        243 allocs/op
--- BENCH: BenchmarkArgoServer/Submit_workflow_with_multiple_refs-12
    printer.go:116: POST /api/v1/workflows/argo HTTP/1.1
        Host: localhost:2746
        Authorization: Bearer [REDACTED]
        
        {
                                                "workflow": {
                                                        "metadata": {
                                                                "generateName": "create-wf-from-template-benchmark-",
                                                                "labels": {
                                                                        "workflows.argoproj.io/benchmark": "true",
        ... [output truncated]
PASS
ok      github.com/argoproj/argo-workflows/v3/test/e2e  6.547s

During template validation k8s API is called for each templateRef.
For complex workflows with many refs it creates huge overhead.
Let's use informer for getting templates and use old mechanism as fallback

Signed-off-by: Jakub Buczak <jbuczak@splunk.com>
Signed-off-by: Jakub Buczak <jbuczak@splunk.com>
Signed-off-by: Jakub Buczak <jbuczak@splunk.com>
Signed-off-by: Jakub Buczak <jbuczak@splunk.com>
Signed-off-by: Jakub Buczak <jbuczak@splunk.com>
Signed-off-by: Jakub Buczak <jbuczak@splunk.com>
…late server

Signed-off-by: Jakub Buczak <jbuczak@splunk.com>
Signed-off-by: Jakub Buczak <jbuczak@splunk.com>
Remove Lister() method (as informer don't support full k8s list options)

Signed-off-by: Jakub Buczak <jbuczak@splunk.com>
Signed-off-by: Jakub Buczak <jbuczak@splunk.com>
fix not starting clusterWftmpl Informer in server
add more descriptive client store naming

Signed-off-by: Jakub Buczak <jbuczak@splunk.com>
Pass created client stores in tests

Signed-off-by: Jakub Buczak <jbuczak@splunk.com>
Signed-off-by: Jakub Buczak <jbuczak@splunk.com>
Enable single benchmark run

Signed-off-by: Jakub Buczak <jbuczak@splunk.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants