Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster: use linkedlist for round_robin_handle #40615

Closed
wants to merge 1 commit into from
Closed

cluster: use linkedlist for round_robin_handle #40615

wants to merge 1 commit into from

Conversation

twchn
Copy link
Contributor

@twchn twchn commented Oct 26, 2021

Currently, an array is used as a queue to manage handles, when there are many handles, the ArrayPrototypeShift may become a bottleneck, so using the builtin linkedlist to reduce the time complexity of handoff method to a constant level and may be helpful for #37343.

@nodejs-github-bot nodejs-github-bot added cluster Issues and PRs related to the cluster subsystem. needs-ci PRs that need a full CI run. timers Issues and PRs related to the timers subsystem / setImmediate, setInterval, setTimeout. labels Oct 26, 2021
@Trott
Copy link
Member

Trott commented Oct 26, 2021

I thought we had a benchmark for round robin somewhere but I'm not seeing it. @nodejs/benchmarking

@yunnysunny
Copy link
Contributor

yunnysunny commented Nov 19, 2021

I thought we had a benchmark for round robin somewhere but I'm not seeing it. @nodejs/benchmarking

You can just use the offical code of cluster:

const cluster = require('cluster');
const http = require('http');
const numCPUs = 2;//require('os').cpus().length;
const process = require('process');

if (cluster.isMaster) {
  console.log(`Primary ${process.pid} is running`);

  // Fork workers.
  for (let i = 0; i < numCPUs; i++) {
    cluster.fork();
  }

  cluster.on('exit', (worker, code, signal) => {
    console.log(`worker ${worker.process.pid} died`);
  });
} else {
  // Workers can share any TCP connection
  // In this case it is an HTTP server
  http.createServer((req, res) => {
    res.writeHead(200);
    res.end('hello world\n');
  }).listen(8000);

  console.log(`Worker ${process.pid} started`);
}

After run the log is print as follows:

Primary 5308 is running
Worker 5321 started
Worker 5322 started

And then you can use some benchmark tools, such as jmeter, to request to the service you have just started. Here is an example configuration of jmeter:

<?xml version="1.0" encoding="UTF-8"?>
<jmeterTestPlan version="1.2" properties="5.0" jmeter="5.3">
  <hashTree>
    <TestPlan guiclass="TestPlanGui" testclass="TestPlan" testname="Test Plan" enabled="true">
      <stringProp name="TestPlan.comments"></stringProp>
      <boolProp name="TestPlan.functional_mode">false</boolProp>
      <boolProp name="TestPlan.tearDown_on_shutdown">true</boolProp>
      <boolProp name="TestPlan.serialize_threadgroups">false</boolProp>
      <elementProp name="TestPlan.user_defined_variables" elementType="Arguments" guiclass="ArgumentsPanel" testclass="Arguments" testname="User Defined Variables" enabled="true">
        <collectionProp name="Arguments.arguments"/>
      </elementProp>
      <stringProp name="TestPlan.user_define_classpath"></stringProp>
    </TestPlan>
    <hashTree>
      <ThreadGroup guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="Thread Group" enabled="true">
        <stringProp name="ThreadGroup.on_sample_error">continue</stringProp>
        <elementProp name="ThreadGroup.main_controller" elementType="LoopController" guiclass="LoopControlPanel" testclass="LoopController" testname="Loop Controller" enabled="true">
          <boolProp name="LoopController.continue_forever">false</boolProp>
          <intProp name="LoopController.loops">-1</intProp>
        </elementProp>
        <stringProp name="ThreadGroup.num_threads">8</stringProp>
        <stringProp name="ThreadGroup.ramp_time">1</stringProp>
        <boolProp name="ThreadGroup.scheduler">false</boolProp>
        <stringProp name="ThreadGroup.duration"></stringProp>
        <stringProp name="ThreadGroup.delay"></stringProp>
        <boolProp name="ThreadGroup.same_user_on_next_iteration">true</boolProp>
      </ThreadGroup>
      <hashTree>
        <HTTPSamplerProxy guiclass="HttpTestSampleGui" testclass="HTTPSamplerProxy" testname="HTTP Request" enabled="true">
          <elementProp name="HTTPsampler.Arguments" elementType="Arguments" guiclass="HTTPArgumentsPanel" testclass="Arguments" testname="User Defined Variables" enabled="true">
            <collectionProp name="Arguments.arguments"/>
          </elementProp>
          <stringProp name="HTTPSampler.domain">please_replace_this_to_the_ip_of_your_service</stringProp>
          <stringProp name="HTTPSampler.port">8000</stringProp>
          <stringProp name="HTTPSampler.protocol"></stringProp>
          <stringProp name="HTTPSampler.contentEncoding"></stringProp>
          <stringProp name="HTTPSampler.path">/</stringProp>
          <stringProp name="HTTPSampler.method">GET</stringProp>
          <boolProp name="HTTPSampler.follow_redirects">true</boolProp>
          <boolProp name="HTTPSampler.auto_redirects">false</boolProp>
          <boolProp name="HTTPSampler.use_keepalive">false</boolProp><!-- disable keepalive to create connection as fast as it can-->
          <boolProp name="HTTPSampler.DO_MULTIPART_POST">false</boolProp>
          <stringProp name="HTTPSampler.embedded_url_re"></stringProp>
          <stringProp name="HTTPSampler.connect_timeout"></stringProp>
          <stringProp name="HTTPSampler.response_timeout"></stringProp>
        </HTTPSamplerProxy>
        <hashTree>
          <ResultCollector guiclass="ViewResultsFullVisualizer" testclass="ResultCollector" testname="View Results Tree" enabled="false">
            <boolProp name="ResultCollector.error_logging">false</boolProp>
            <objProp>
              <name>saveConfig</name>
              <value class="SampleSaveConfiguration">
                <time>true</time>
                <latency>true</latency>
                <timestamp>true</timestamp>
                <success>true</success>
                <label>true</label>
                <code>true</code>
                <message>true</message>
                <threadName>true</threadName>
                <dataType>true</dataType>
                <encoding>false</encoding>
                <assertions>true</assertions>
                <subresults>true</subresults>
                <responseData>false</responseData>
                <samplerData>false</samplerData>
                <xml>false</xml>
                <fieldNames>true</fieldNames>
                <responseHeaders>false</responseHeaders>
                <requestHeaders>false</requestHeaders>
                <responseDataOnError>false</responseDataOnError>
                <saveAssertionResultsFailureMessage>true</saveAssertionResultsFailureMessage>
                <assertionsResultsToSave>0</assertionsResultsToSave>
                <bytes>true</bytes>
                <sentBytes>true</sentBytes>
                <url>true</url>
                <threadCounts>true</threadCounts>
                <idleTime>true</idleTime>
                <connectTime>true</connectTime>
              </value>
            </objProp>
            <stringProp name="filename"></stringProp>
          </ResultCollector>
          <hashTree/>
          <ResultCollector guiclass="SummaryReport" testclass="ResultCollector" testname="Summary Report" enabled="true">
            <boolProp name="ResultCollector.error_logging">false</boolProp>
            <objProp>
              <name>saveConfig</name>
              <value class="SampleSaveConfiguration">
                <time>true</time>
                <latency>true</latency>
                <timestamp>true</timestamp>
                <success>true</success>
                <label>true</label>
                <code>true</code>
                <message>true</message>
                <threadName>true</threadName>
                <dataType>true</dataType>
                <encoding>false</encoding>
                <assertions>true</assertions>
                <subresults>true</subresults>
                <responseData>false</responseData>
                <samplerData>false</samplerData>
                <xml>false</xml>
                <fieldNames>true</fieldNames>
                <responseHeaders>false</responseHeaders>
                <requestHeaders>false</requestHeaders>
                <responseDataOnError>false</responseDataOnError>
                <saveAssertionResultsFailureMessage>true</saveAssertionResultsFailureMessage>
                <assertionsResultsToSave>0</assertionsResultsToSave>
                <bytes>true</bytes>
                <sentBytes>true</sentBytes>
                <url>true</url>
                <threadCounts>true</threadCounts>
                <idleTime>true</idleTime>
                <connectTime>true</connectTime>
              </value>
            </objProp>
            <stringProp name="filename"></stringProp>
          </ResultCollector>
          <hashTree/>
        </hashTree>
      </hashTree>
    </hashTree>
  </hashTree>
</jmeterTestPlan>

We give a name of this confugration of cluster.jmx.
Notice that , we have disabled the header of keepalive to make jmeter create sockets between server quickly.

At last , run the command of jmeter:

 bin/jmeter.sh -n -t /dir_of_jmx/cluster.jmx -l /tmp/cluster.jtl -e -o /tmp/cluster.out

you can see the CPU usages via the command of top:

屏幕截图 2021-11-19 180325

The primary process (with pid 5308 ) also uses high CPU time.

@Trott
Copy link
Member

Trott commented Nov 19, 2021

@nodejs/cluster

@mcollina
Copy link
Member

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with a positive or neutral benchmark CI

@mcollina
Copy link
Member

Benchmark CI is neutral.

@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

@mcollina
Copy link
Member

cc @Trott could you take a look on how to land this?

@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

@Trott
Copy link
Member

Trott commented Nov 28, 2021

cc @Trott could you take a look on how to land this?

@mcollina I pulled the changes down locally, rebased against master (to get the fix on master branch for the test that was failing on debug builds in Jenkins), squashed the two commits into one (to fix the GitHub Action commit linter complaint), force pushed to twchn's branch for this PR, and did a rebuild on Jenkins. Hopefully everything is green and then the commit-queue label can be added.

@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

@Trott Trott added the commit-queue Add this label to land a pull request using GitHub Actions. label Nov 29, 2021
@nodejs-github-bot nodejs-github-bot added commit-queue-failed An error occurred while landing this pull request using GitHub Actions. and removed commit-queue Add this label to land a pull request using GitHub Actions. labels Nov 29, 2021
@nodejs-github-bot
Copy link
Collaborator

Commit Queue failed
- Loading data for nodejs/node/pull/40615
✔  Done loading data for nodejs/node/pull/40615
----------------------------------- PR info ------------------------------------
Title      cluster: use linkedlist for round_robin_handle (#40615)
   ⚠  Could not retrieve the email or name of the PR author's from user's GitHub profile!
Branch     twchn:feat_use_linkedlist_for_round_robin -> nodejs:master
Labels     cluster, timers, needs-ci
Commits    1
 - cluster: use linkedlist for round_robin_handle
Committers 1
 - Rich Trott 
PR-URL: https://github.com/nodejs/node/pull/40615
Reviewed-By: Matteo Collina 
Reviewed-By: James M Snell 
------------------------------ Generated metadata ------------------------------
PR-URL: https://github.com/nodejs/node/pull/40615
Reviewed-By: Matteo Collina 
Reviewed-By: James M Snell 
--------------------------------------------------------------------------------
   ⚠  Commits were pushed since the last review:
   ⚠  - cluster: use linkedlist for round_robin_handle
   ℹ  This PR was created on Tue, 26 Oct 2021 16:12:34 GMT
   ✔  Approvals: 2
   ✔  - Matteo Collina (@mcollina) (TSC): https://github.com/nodejs/node/pull/40615#pullrequestreview-811397947
   ✔  - James M Snell (@jasnell) (TSC): https://github.com/nodejs/node/pull/40615#pullrequestreview-811856870
   ✔  Last GitHub Actions successful
   ℹ  Last Benchmark CI on 2021-11-19T16:05:03Z: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1058/
   ℹ  Last Full PR CI on 2021-11-29T02:31:19Z: https://ci.nodejs.org/job/node-test-pull-request/41180/
- Querying data for job/node-test-pull-request/41180/
   ✔  Last Jenkins CI successful
--------------------------------------------------------------------------------
   ✔  Aborted `git node land` session in /home/runner/work/node/node/.ncu
https://github.com/nodejs/node/actions/runs/1514597734

PR-URL: #40615
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
@Trott
Copy link
Member

Trott commented Nov 29, 2021

Landed in 4b65dec

@Trott
Copy link
Member

Trott commented Nov 29, 2021

Thanks for the contribution! 🎉

@Trott Trott closed this Nov 29, 2021
@yunnysunny
Copy link
Contributor

yunnysunny commented Nov 29, 2021

I have writen a kubernetes yaml file to test the performance of cluster. The node version I used is 14.18.1.

apiVersion: apps/v1 
kind: Deployment
metadata:
  name: hello-deployment
  labels:
    app: hello
spec:
  replicas: 1
  selector:
    matchLabels:
      app: hello
  template:
    metadata:
      name: hello-app
      labels:
        app: hello
    spec:
      initContainers:
      - image: busybox
        command:
        - sh
        - -c
        - |
          sysctl -w net.core.somaxconn=10240
          sysctl -w net.ipv4.ip_local_port_range="1024 65535"
          sysctl -w net.ipv4.tcp_tw_reuse=1
          sysctl -w fs.file-max=6048576
        name: setsysctl
        securityContext:
          privileged: true
      containers:
        - name: hello-app
          image: registry.cn-hangzhou.aliyuncs.com/whyun/base:hello-latest
          imagePullPolicy: Always
          resources:
            requests:
              cpu: 2000m
              memory: 2Gi
            limits:
              cpu: 4000m
              memory: 4Gi
          env:
            - name: APP_ID
              value: "17959"
            - name: APP_SECRET
              value: "6994ea9b6a8d1e673d9cc53aab8e45dd8eaec8d2"

---
kind: Service
apiVersion: v1
metadata:
  name: hello-service
spec:
  selector:
    app: hello
  ports:
    - port: 8000 # Default port for image


---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: bench-deployment
  labels:
    app: bench
spec: 
  replicas: 16
  selector:
    matchLabels:
      app: bench
  template:
    metadata:
      name: bench-node
      labels:
        app: bench
    spec:
      initContainers:
      - image: busybox
        command:
        - sh
        - -c
        - |
          sysctl -w net.core.somaxconn=10240
          sysctl -w net.ipv4.ip_local_port_range="1024 65535"
          sysctl -w net.ipv4.tcp_tw_reuse=1
          sysctl -w fs.file-max=1048576
        name: setsysctl
        securityContext:
          privileged: true
      containers:
        - name: bench-node
          image: registry.cn-hangzhou.aliyuncs.com/whyun/base:node-bench-0.2.0
          env:
            - name: APP_ID
              value: "the app id from alinode"
            - name: APP_SECRET
              value: "the app secret from alinode"
            - name: REQ_URL
              value: http://hello-service:8000/
            - name: REQ_INTERVAL_MS
              value: "5"
            - name: REQ_TIMEOUT_MS
              value: "20"

I use alinode as a tool to generate CPU flame graph, which integrated in the docker image. You have to change the environment variable of APP_ID and APP_SECRET to the correct value you applied from alinode.

We can just change the environment variable of REQ_INTERVAL_MS to a special number and then restart your kubernetes deployment via the command kubectl apply -f the_path_of_the_yaml_file_above.

I first set the value of REQ_INTERVAL_MS to 5, and get the CPU usage from top command, and generate the flame graph from alinode.

16benchs
5 REQ_INTERVAL_MS CPU usage

image
5 REQ_INTERVAL_MS CPU flame graph

Then set it to 4:

16benchs_with_4s
4 REQ_INTERVAL_MS CPU usage

image
4 REQ_INTERVAL_MS CPU flame graph

It changed little from 5 ms.

Then set it to 3:

16benchs_with_3s
3 REQ_INTERVAL_MS CPU usage

image
3 REQ_INTERVAL_MS CPU flame graph

We can figure out that it uses more CPU time than 4ms.

Then set it to 2:

16bench_with_2s
2 REQ_INTERVAL_MS CPU usage before generate flame graph

Firstly I took a screenshot of the CPU uages. It's almost the same as the usage under 3ms. Next step, when I generated the CPU flame graph, the cpu usage of the parent process raised to 100%.

16bench_with_2s_2
2 REQ_INTERVAL_MS CPU usage during generate flame graph

And I also found that the memery usage raised fast.

image
2 REQ_INTERVAL_MS CPU flame graph

Combine the CPU flame graph , we can deduce the reason, 100% CPU is used by V8 engine, the task of sending socket handle to the children processes slows down, some socket handles would be saved in the queue (which is an array in js), the queue length becomes larger, it will use more time to do shift operation of the array, the CPU usage will be still high, the queue length will be larger next time. Whe the queue length is too large, the memory will exceed the limit of old generation, and the process will be OOM.

The merge request can reduce the probability of OOM, but we can also found the commuication between parent and children is not cheap. So I will be glad to see the feature of SO_REUSEPORT will be brought in node, this already a pull request #3198 in libuv on it.

@twchn twchn deleted the feat_use_linkedlist_for_round_robin branch December 1, 2021 08:27
danielleadams pushed a commit that referenced this pull request Dec 13, 2021
PR-URL: #40615
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
danielleadams pushed a commit that referenced this pull request Dec 14, 2021
PR-URL: #40615
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
danielleadams pushed a commit that referenced this pull request Jan 30, 2022
PR-URL: #40615
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
danielleadams pushed a commit that referenced this pull request Feb 1, 2022
PR-URL: #40615
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
@danielleadams danielleadams mentioned this pull request Feb 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cluster Issues and PRs related to the cluster subsystem. commit-queue-failed An error occurred while landing this pull request using GitHub Actions. needs-ci PRs that need a full CI run. timers Issues and PRs related to the timers subsystem / setImmediate, setInterval, setTimeout.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants