Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket hang up after 2nd query #150

Closed
GGo3 opened this issue Apr 25, 2023 · 56 comments
Closed

Socket hang up after 2nd query #150

GGo3 opened this issue Apr 25, 2023 · 56 comments
Assignees
Labels
bug Something isn't working

Comments

@GGo3
Copy link

GGo3 commented Apr 25, 2023

Get this error after 2nd query to clickhouse

Error: socket hang up
    at connResetException (node:internal/errors:691:14)
    at Socket.socketOnEnd (node:_http_client:471:23)
    at Socket.emit (node:events:402:35)
    at endReadableNT (node:internal/streams/readable:1343:12)
    at processTicksAndRejections (node:internal/process/task_queues:83:21) {
  code: 'ECONNRESET'
}

Steps to reproduce

  1. Make 1 query with results
  2. Make some actions with stat less than 7-10 sec
  3. After that make one more query and get this error

Also, try to increase connect_timeout to 20 sec but it didn't help.

@GGo3 GGo3 added the bug Something isn't working label Apr 25, 2023
@slvrtrn
Copy link
Contributor

slvrtrn commented Apr 25, 2023

What is the environment where the code is executed and the Node.js version?
Could you provide a minimal reproduction scenario?

@GGo3
Copy link
Author

GGo3 commented Apr 25, 2023

Sure
Script run on server Ubuntu 18.04.4 LTS
Node v16.13.2
npm 8.1.2
@clickhouse/client 0.0.14

Clickhouse server 21.7.4.18

in createClient use only host username password database from .env
The problem is that when we have one open CH pool, we make select queries to the database one after another. When the interval between these queries is not long, up to 1-2 seconds, everything works smoothly. However, as soon as a longer interval is made between the queries (more then 5 sek), we receive this error. It feels like a feature was added where if the connection is not used for some time, the socket cannot be used.

@olexiyb
Copy link
Contributor

olexiyb commented Apr 25, 2023

We also try to understand the cause of ECONNRESET we see randomly
This article gives some clue

@slvrtrn
Copy link
Contributor

slvrtrn commented Apr 26, 2023

@GGo3, do you have a load balancer in front of ClickHouse? It could be one of the reasons for timeouts. Additionally, we just released 0.0.15 (props to @olexiyb); you could try checking it out, maybe the fix will work for you as well.

@GGo3
Copy link
Author

GGo3 commented Apr 26, 2023

Nope. Just remote clickhouse server, every 10 min by crontab start node script, createClient to CH and work with data.

@GGo3
Copy link
Author

GGo3 commented Apr 26, 2023

Updated new version 0.0.15. It doesn't help

@slvrtrn
Copy link
Contributor

slvrtrn commented Apr 26, 2023

@GGo3, have you tried increasing request_timeout?

Additionally, CH 21.7 is quite old. Have you tried it with more recent CH versions?

@slvrtrn
Copy link
Contributor

slvrtrn commented Apr 26, 2023

@GGo3 could you provide a code snippet to help us to reproduce the problem?

I tried to reproduce it locally, but it worked fine.

My env: Fedora 37, Node.js v16.17.0, ClickHouse 23.3, @clickhouse/client 0.0.15

import {createClient} from '@clickhouse/client'

void (async () => {
    const client = createClient({
        connect_timeout: 5_000,
        request_timeout: 60_000,
    })
    let rows
    for (let i = 1; i <= 10; i++) {
        rows = await longRunningQuery(client)
        console.info(`#${i}`, await rows.json())
        if (i !== 10) {
            await sleep(30_000)
        }
    }
    await client.close()
})()

function longRunningQuery(client) {
    return client.query({
        query: `SELECT sleep(3), rand64();`,
        format: 'JSONEachRow',
    })
}

function sleep(ms) {
    return new Promise((resolve) => {
        setTimeout(resolve, ms);
    });
}

I suppose it is similar to the use case described in the OP.

It prints:

#1 [ { 'sleep(3)': 0, 'rand64()': '3559013576299350133' } ]
#2 [ { 'sleep(3)': 0, 'rand64()': '5520979173458390446' } ]
#3 [ { 'sleep(3)': 0, 'rand64()': '10411217679392783453' } ]
#4 [ { 'sleep(3)': 0, 'rand64()': '1204984556245591429' } ]
#5 [ { 'sleep(3)': 0, 'rand64()': '2927514430233340513' } ]
#6 [ { 'sleep(3)': 0, 'rand64()': '4869895186792984655' } ]
#7 [ { 'sleep(3)': 0, 'rand64()': '16062549944760496101' } ]
#8 [ { 'sleep(3)': 0, 'rand64()': '16442138677355003640' } ]
#9 [ { 'sleep(3)': 0, 'rand64()': '158837237593955734' } ]
#10 [ { 'sleep(3)': 0, 'rand64()': '7616247968259855513' } ]

and there are no exceptions.

Additionally, I tested it with our Cloud, similar code, only added host and password. Again, no exceptions.

@slvrtrn slvrtrn added question Further information is requested and removed bug Something isn't working labels Apr 26, 2023
@GGo3
Copy link
Author

GGo3 commented Apr 26, 2023

Hmm. Test your script and it work fine. I will deal with it and inform you later.

@olexiyb
Copy link
Contributor

olexiyb commented Apr 27, 2023

Hi, my fix for timeout case was related to nodejs 19, I was able to fix it just because it was reproducible.
But we still see completely random socket hang up issue.
In our case we used Debian 11 + nodejs 18 + clickhouse 23.1 in production.
now we are migrating to Alpine 3.17 + nodejs 19 + clickhouse 23.3 and still see 1-2 hang issues per day
So far I was not able to identify the cause, but I see numerous reported issues in nodejs
nodejs/node#39810
nodejs/node#47228

Maybe we should use undici for clickhouse client...

@nlapshin
Copy link

nlapshin commented May 7, 2023

I see the same problem in my project. Ubuntu 22.04, nodejs 18.16 + clickhouse 22. I do not think that It depends of clickhouse or os version, more probably it depends of nodeJS version.

I changed versions node.js use 16 and 18 versions and caught socket hang up very often. My case: I call long http request to API(need to get big batch of data for one request) - it takes 1-2 minutes and after that try to delete(query below) old data and insert new batch of data.

And I see the same problem then run nodeJS 20 version rarely.

ALTER TABLE {table: Identifier} DELETE where date between '2023-05-05' and '2023-05-05' "

@nlapshin
Copy link

nlapshin commented May 7, 2023

And It does not depend that query I do. For example, I send
SELECT sleep(3), rand64();
and get the same error(socket hang up)

For HTTP request I use node-fetch package, version 3.3.0

@Jank1310
Copy link

@olexiyb Did you find a fix to the hang ups / did you try undici?

@slvrtrn
Copy link
Contributor

slvrtrn commented Jun 22, 2023

We reworked how request/socket timeouts work internally in 0.1.0. Can someone please check if the issue is still reproducible after the upgrade?

@GGo3
Copy link
Author

GGo3 commented Jun 22, 2023

The problem is still presented. Update version from 0.0.14 to 0.1.0. Also, try to set
connect_timeout: 5_000,
request_timeout: 60_000

Maybe need to test on the latest CH version, because we test it on 21.7.4.18

@slvrtrn
Copy link
Contributor

slvrtrn commented Jun 22, 2023

@GGo3, could you provide a minimal repro for the issue?

@GGo3
Copy link
Author

GGo3 commented Jun 22, 2023

Script run on server Ubuntu 18.04.4 LTS
Node v16.13.2
npm 8.1.2
@clickhouse/client 0.1.0

Clickhouse server 21.7.4.18

run script start with create an instance new Clickhouse(config).

CH:

const clickhouseClient = require("@clickhouse/client");

class Clickhouse {
    constructor (config) {
        if (!config.host || !config.user || !config.password || !config.database) {
            throw "Credentials not found";
        }

        const protocol = "http";
        const url = `${protocol}://${config.host}:${config.port || 8123}`;

        this.client = clickhouseClient.createClient({
            host: url,
            username: config.user,
            password: config.password,
            database: config.database,
            connect_timeout: 5_000,
            request_timeout: 60_000
        });
    }

    async query (sql) {
        const dataSet = await this.client.query({
            query: sql,
            format: "JSONEachRow"
        });

        const data = [];
        for await (const rows of dataSet.stream()) {
            for (let i = 0; i < rows.length; i++) {
                data.push(rows[i].json());
            }
        }

        return data;
    }

    async close () {
        await this.client.close();
    }
}

after that
Make 1 query with correct results
Make some actions with stat less than 7-10 sec
After that make 1 more query and get this error

@slvrtrn
Copy link
Contributor

slvrtrn commented Jun 22, 2023

@GGo3, a few questions regarding the example:

  • when close is called?
  • how the CH server is deployed? Any LB in front of it?

@GGo3
Copy link
Author

GGo3 commented Jun 22, 2023

  1. Close only after the script is finished. I mean we wait for all queries result, it can be 10-30 query one by one.
  2. CH server without LB on (Ubuntu 20.04.3 LTS)

@nlapshin
Copy link

@slvrtrn Today I updated to the latest version and It seems to work well. I'll try to get this into production next week and send my feedback later. Thanks.

@nlapshin
Copy link

@slvrtrn Today our team tested it and it seems the same problems are existed

@slvrtrn slvrtrn self-assigned this Jun 26, 2023
@slvrtrn
Copy link
Contributor

slvrtrn commented Jun 26, 2023

@GGo3, @nlapshin,
I managed to reproduce the issue locally.

Could you try tuning the server keep alive setting?

For example, in config.xml:

<keep_alive_timeout>120</keep_alive_timeout>

The number here (in seconds) should be more than the expected idle of the application.

In the meantime, we are figuring out the proper solution on the client side.

EDIT: it is actually 3 seconds by default

@slvrtrn slvrtrn added bug Something isn't working and removed question Further information is requested labels Jun 27, 2023
@movy
Copy link

movy commented Jul 16, 2023

It looks like with keep_alive off and client reconnection upon a error thrown, everything is pretty stable so far. I guess the docs need to mention keep_alive nuances and examples need to include proper error handling strategies. I wish I could help with that, but currently i'm en-route with very little spare time till August. If this still requires attention, I may PR then.

Thanks heaps for your help!

@slvrtrn
Copy link
Contributor

slvrtrn commented Jul 16, 2023

@movy, how do you reconnect the client? By closing it (effectively destroying the underlying HTTP agent) and creating a new instance?

@amitava82
Copy link

Alright, I was going nuts with the same issue. Thankfully I found this thread. Is it recommended to keep the keep_alive off? I'd think it'll optimal to keep the connection open to reduce latency. In my APIs I'm getting lot of socket hang up error when user make frequent requests. Won't it add to latency with the TCP connection overhead?

Another question, I'm not calling await client.close() after ch query completed. Should I be doing it?

@slvrtrn
Copy link
Contributor

slvrtrn commented Jul 17, 2023

@amitava82, you could try the workarounds described in the docs.
Disabling KeepAlive will increase latency as the sockets with open connections won't be reused.
You should not call await client.close() after each query. Ideally, it should be called only once during the graceful shutdown of the app.

@amitava82
Copy link

Thanks! I'm trying out the suggestions. Not seeing any errors but I'll report if any.

@slvrtrn slvrtrn pinned this issue Jul 18, 2023
@movy
Copy link

movy commented Jul 18, 2023

I just learned that my reconnection routine does not work, as after most recent ECONNRESET the ingestion stopped, i.e. client never reconnected.

My code:

 class ClickHouse {
    #createClient() {
        return createClient({
            host: this.#local ? process.env.CLICKHOUSE_HOST : process.env.CLOUDHOUSE_HOST,
            username: 'default',
            password: this.#local ? process.env.CLICKHOUSE_SECRET : process.env.CLOUDHOUSE_SECRET,
            keep_alive: {
                enabled: false,
            },
        })


    async closeStreams() {
        for (const stream of Object.values(this.#streams)) {
            stream.push(null)
        }
        await Promise.all(Object.values(this.#insertPromises))
        await this.#client.close()
    }

    this.#insertPromises[table] = this.#client
                .insert({
                    table,
                    values: this.#streams[table],
                    format: 'JSONEachRow',
                })
                .then(() => console.info(`${table} data ingestion is finished`))
                .catch((error) => {
                    this.closeStreams().then(() => this.#client = this.#createClient())
                })

}

@slvrtrn, what's the correct way to recover from this error?

Also note, this error happens only with ClickHouse Cloud set up. With my local CH instance so far hasn't had any socket hang ups.

@slvrtrn
Copy link
Contributor

slvrtrn commented Jul 18, 2023

@movy, you just need to recreate the failed ingestion promise for a particular table in case of an error. In your snippet, I see that you recreate the client (without closing the previous instance), but I don't think you need to do it at all.

@kobi-co
Copy link

kobi-co commented Sep 3, 2023

@slvrtrn I have slightly different issue (with a similar error):
I have a clickhouse cluster we are connecting to using vpn.
When I call ping() (When vpn is turned off) on this client instance, I'm getting a timeout error (which is OK, it's not thrown out), but in addition there in a uncaughtException which destroys my process:

Error: socket hang up
    at connResetException (node:internal/errors:705:14)
    at TLSSocket.socketCloseListener (node:_http_client:467:25)
    at TLSSocket.emit (node:events:525:35)
    at TLSSocket.emit (node:domain:489:12)
    at node:net:301:12
    at TCP.done (node:_tls_wrap:588:7)
    at TCP.callbackTrampoline (node:internal/async_hooks:130:17) {
  code: 'ECONNRESET'
}

As opposed to this, When I use query in order to ping (tried a few different approaches), I don't get the same error (only timeout error without socket hang up that kills my process).
Any idea here ?

Thanks!

@slvrtrn
Copy link
Contributor

slvrtrn commented Sep 4, 2023

@kobi-co, this is interesting cause we explicitly destroy the response stream, and the entire call is wrapped in a try-catch in the internal implementation (see this).

Please confirm that you are running at least 0.2.1 and getting these errors there. I will try to debug it then.

@kobi-co
Copy link

kobi-co commented Sep 4, 2023

I was running it with 0.2.0 (was there any chance to 0.2.1 ?), We have debugged it and the destroy there is not doing the job.
When I added a destroy after this line just to test it, it solved the issue.
If you are having trouble reproducing it, I can try to help..

@slvrtrn
Copy link
Contributor

slvrtrn commented Sep 4, 2023

@kobi-co, yes, that's what I was thinking as well - the request stream is not destroyed properly. Thanks for the report!

@igolka97
Copy link

Getting same error so far with 2.3.0 ver. Also using streams as the participants above. Tried with or without keepalive option.

Env: node 18 alpine, CH 23.5.3.24. Everything running on docker separate containers.

@slvrtrn
Copy link
Contributor

slvrtrn commented Oct 16, 2023

@igolka97, can you please provide more details about your use case?

  • sample code
  • client config
  • how frequently is the data coming to/from the streams?
  • is there any load balancer between the application and ClickHouse?

@slvrtrn
Copy link
Contributor

slvrtrn commented Oct 18, 2023

I am thinking about releasing this recent branch main...idle-sockets-timeouts as 0.3.0-beta1, as I tried to implement a proper idle socket removal there (using some undocumented Node.js API and timeout shenanigans, but oh well).

From my tests, with an endless loop with a random number of concurrent requests and random waits between them, it is stable, and from the logs, I could tell that the sockets were expiring while idling as expected.

@GGo3 @amitava82 @gabibora @nlapshin @olexiyb
apologies for the direct ping, but would you be interested in trying it?

@igolka97
Copy link

@slvrtrn

  • there are no any balancer between CH and application
  • frequency is about 100-200 pushes to stream per second
  • client config is simple as possible
return createClient({ host, username, password, database });
  • sample code:
export class StreamRequestLoggerService<T extends PossibleLogRequest>
  extends RequestLoggerInterface<T>
  implements OnApplicationShutdown
{
  constructor(
    private readonly tableName: string,
    private readonly clickhouseClient: ClickHouseClient,
  ) {
    super();
    this.insertStream = this.clickhouseClient.insert({
      table: this.tableName,
      format: 'JSONEachRow',
      values: this.stream,
    });
  }

  stream = new Readable({
    objectMode: true,
    read: () => {
      /**/
    },
  });

  insertStream: Promise<ConnBaseResult>;

  onApplicationShutdown() {
    this.stream.push(null);
    return this.insertStream;
  }

  async logRequest(request: Array<T>) {
    request.forEach((r) => {
      this.stream.push(r);
    });
  }
}

@mshustov
Copy link
Member

@igolka97 how do you use logRequest? Is the stream open all the time while the app running?

@igolka97
Copy link

igolka97 commented Oct 18, 2023

@mshustov
yes, the stream is open all the time.
I'm using logRequest as insert abstraction over original clickhouse-js insert method in my app. Calling logRequest each time as simple as that is when new http request is coming.

UPD BTW
stream opens just on application startup and closes on application shutdown

@slvrtrn
Copy link
Contributor

slvrtrn commented Oct 18, 2023

@igolka97, the code you provided looks similar to what we have in our examples for endless streams, so I see no obvious issues.

I adjusted it slightly for my tests to specifically check the disabled keep alive setting, as you mentioned it:

import { ClickHouseClient, createClient } from "@clickhouse/client";
import { randomInt } from "crypto";
import { Readable } from "stream";

export class StreamRequestLoggerService {
  stream = new Readable({
    objectMode: true,
    read: () => {
      /**/
    },
  });

  insertStream: Promise<unknown>;

  constructor(
    private readonly tableName: string,
    private readonly clickhouseClient: ClickHouseClient,
  ) {
    this.insertStream = this.clickhouseClient
      .insert({
        table: this.tableName,
        format: "JSONEachRow",
        values: this.stream,
      })
      .then(() => console.info("\nData ingestion is finished"));
  }

  onApplicationShutdown() {
    this.stream.push(null);
    return this.insertStream;
  }

  logRequest<T>(request: Array<T>) {
    request.forEach((r) => {
      this.stream.push(r);
    });
  }
}

async function main() {
  const client = createClient({
    keep_alive: {
      enabled: false,
    },
  });
  const tableName = "test_logs";
  await client.command({
    query: `
      CREATE OR REPLACE TABLE ${tableName}
      (id UInt64, name String)
      ENGINE MergeTree()
      ORDER BY (id)
    `,
  });
  const log = new StreamRequestLoggerService("test_logs", client);

  for (let i = 0; i < 10_000; i++) {
    console.info(`[${i + 1}] Pushing several records into the stream...`);
    const data = [...Array(randomInt(100, 10_000))].map(() => ({
      id: randomInt(1, 100_000_000),
      name: Math.random().toString(36).slice(2),
    }));
    log.logRequest(data);
    await new Promise((resolve) => setTimeout(resolve, randomInt(1, 1000)));
  }

  // When Ctrl+C is pressed...
  async function cleanup() {
    await log.onApplicationShutdown();
    await client.close();
    process.exit(0);
  }

  process.on("SIGINT", cleanup);
  process.on("SIGTERM", cleanup);
}

main().catch((err) => {
  console.error(err);
  process.exit(1);
});

No issues with thousands of iterations here on my local setup...

The only way to trigger ECONNRESET/socket hang up was to kill the ClickHouse instance.

How often do you experience socket hang up in your case? Are you sure that the ClickHouse instance is not restarted at that time?

@movy
Copy link

movy commented Nov 1, 2023

I think I managed to create a reproducible test case, see below.

In my new project I use a universal Clickhouse class that is used for streaming inserts or for simple inserts / updates, depending on who's using the class. This lead me to discovery that if I create a insertPromise but never use it, normal inserts will fail within a minute after opening the connection (basically the opening case of this issue).

I realized such client 'mixed-use' might be an anti-pattern in the realm of this library, so I tried creating separate clients for each function call (via createClient(..)), but this did not help.

Eventually, I split my code into separate branches, so insertPromise is never created unless we actually intend to stream inserts on the instance.

After reading node-fetch/node-fetch#1735 and vlucas/frisby#594 I tried different versions of Node (17-21 I believe), same result everywhere.

Please note that the error is not catchable, i.e. inevitably crashes the whole app each time it's encountered.
Clickhouse 23.9.2.56, "@clickhouse/client": "^0.2.4", Ubuntu 22.04

I tried myriad of keep_alive settings client and server-side, the results are basically the same.

The code includes 4 cases, two of them will fail:

import { createClient } from '@clickhouse/client'
import * as utils from './utils.js'
import Stream from 'node:stream'

export class ClickHouse {
    #insertPromise
    #client

    constructor() {
        this.#createStream()
        this.#client = createClient({
            host: process.env.CLICKHOUSE_HOST,
            username: 'default',
            password: process.env.CLICKHOUSE_SECRET,

            keep_alive: {
                enabled: false,
                // socket_ttl: 2500,
                // retry_on_expired_socket: true,
            }
        })
    }
    #createStream() {
        this.stream = new Stream.Readable({
            objectMode: true,
            read() {},
        })
    }
    async init() {
        await this.#client.command({
            query: `
                CREATE TABLE IF NOT EXISTS test_table(
                    id UInt32,
                    name String,
                    time DateTime64
                ) ENGINE = Memory
                `,
        }).catch(error => console.error('⚠️ clickhouse init error:', error))
    }

    async createPromise(table) {
        this.#insertPromise = this.#client
            .insert({
                table,
                values: this.stream,
                format: 'JSONEachRow',
            })
            .catch(async (error) => {
                console.error(error)
                process.exit(255)
            })
    }

    async streamTestData() {
        console.log('streaming in test data')
        for (let index = 0; index < 1000; index++) {
            this.stream.push({
                id: index,
                name: 'test',
                time: Date.now(),
            })
        }
    }

    // insert data using INSERT (not stream) with 5 sec sleep in between
    async insertTestData() {
        for (let index = 0; index < 10; index++) {
            console.log('inserting test data', index)
            await this.#client.insert({
                table: 'test_table',
                values: [
                    {
                        id: index,
                        time: Date.now(),
                        name: 'test',
                    }
                ],
                format: 'JSONEachRow',
            }).catch(console.error)
            await utils.sleep(5000)
        }
    }

    async closeStreams() {
        // count rows in test_table
        const { data } = await (await this.#client.query({ query: 'SELECT count(*) FROM test_table' })).json()
        console.log('count rows in test_table', data)
        // close stream
        console.log('clickhouse cleanup')
        this.stream.push(null)
        // stream.destroy()
        // when the stream is closed, the insert stream can be awaited
        await this.#insertPromise

        await this.#client.close()
        console.log('clickhouse cleanup done')
    }
}

// this passes
const test1 = async () => {
    // await clickhouse.createPromise('test_table')
    // await clickhouse.streamTestData()
    await clickhouse.insertTestData()
}

// this passes
const test2 = async () => {
    await clickhouse.createPromise('test_table')
    await clickhouse.streamTestData()
    // await clickhouse.insertTestData()
}

// this fails with SOCKET_TIMEOUT
const test3 = async () => {
    await clickhouse.createPromise('test_table')
    await clickhouse.streamTestData()
    await clickhouse.insertTestData()
}

// this fails with Error: socket hang up ECONNRESET
const test4 = async () => {
    await clickhouse.createPromise('test_table')
    // await clickhouse.streamTestData()
    await clickhouse.insertTestData()
}

const clickhouse = new ClickHouse()
await clickhouse.init()

// await test1()
// await test2()
// await test3()
await test4()

await clickhouse.closeStreams()

@slvrtrn
Copy link
Contributor

slvrtrn commented Nov 1, 2023

@movy, I checked your example, but in my case (Node.js 18x., Fedora), all four pass. I see one socket hang up error at the very end when the application quits, though.

First take with the output (exact same code, just slightly adjusted to TS): https://gist.github.com/slvrtrn/392c6dd5371d84651a3c2d13ed32946a

Looks like socket hang up happens because we have some dangling promises produced here:

async createPromise(table) {
    this.#insertPromise = this.#client
        .insert({
            table,
            values: this.stream,
            format: 'JSONEachRow',
        })
        .catch(async (error) => {
            console.error(error)
            process.exit(255)
        })
}

The stream is reused for all of these promises (which causes issues cause we have some conflicting listeners now, i.e., UB).

Here is a fixed snippet (see some adjustments in createPromise): https://gist.github.com/slvrtrn/f920f042dea8e6ebfdb9728b124a5b71

NB: it also flushes the data correctly, as it should be 2030 (and not 30) records there.

Enabling all KeepAlive settings with the fixed example like this:

keep_alive: {
  enabled: true,
  socket_ttl: 2500,
  retry_on_expired_socket: true,
}

Causes no issues.

EDIT: I also checked your example with my unreleased idle sockets timeout branch, and, most likely, the "socket hang up" in the 4th example happens cause we create an empty stream without any initial data coming in, and server just severs the connection by itself exactly after 3 seconds of waiting (this is the default setting of keep_alive_timeout in config.xml). Pushing just one record fixes it, i.e., it's better not to create the insert stream in advance, only when there is at least something to push for the first time, keeping the stream afterward.

See: https://gist.github.com/slvrtrn/7fb24918661e9b5066a131f32b194ca1

If you are interested in trying out the reworked idle sockets and giving your feedback on how it works in your scenario, please send me a DM in the community Slack (the same name as in the GitHub profile). I will share the build (as I am not in favor of releasing breaking changes if it does not fix the issue or introduce other ones).

@movy
Copy link

movy commented Nov 2, 2023

@slvrtrn, I apologise for the confusion, but tests1-4 are to be run separately (hence I commented out all but one in my code).

Thanks for your explanation regarding test4. Maybe it's worth to be documented, as may be obvious in hindsight, it might not be so for new library users. Basically it means we'd have to instantly push at least one record upon a promise creation, as in many practical use cases a promise is created on an app's launch, but data can come at any point later in time, not necessarily immediately after start. 3 seconds timeout is definitely insufficient here.

One big problem with ECONNRESET is that it's uncatchable and often crashes the whole app without proper stack-trace, i.e. if we use many connections / streams and one of them hangs up like this, it is very tricky to find the piece of code that caused it.

As for test3, if run separately, the promise is utilised and the stream is destroyed correctly, yet still SOCKET_TIMEOUT error. I have never seen this in production though, only while creating this test-case.

@breslavsky
Copy link

I can confirm, we have the same issue:

return createClient({
    keep_alive: {
        enabled: true,
        socket_ttl: 60000,
        retry_on_expired_socket: true
    }
});

In Sentry 174 errors per 100k inserts to ClickHouse.

Screenshot 2024-01-14 at 10 10 50

@slvrtrn
Copy link
Contributor

slvrtrn commented Jan 14, 2024

@breslavsky, why is your socket_ttl set to 60_000? what is your ClickHouse server keep_alive_timeout setting?

@breslavsky
Copy link

@breslavsky, why is your socket_ttl set to 60_000? what is your ClickHouse server keep_alive_timeout setting?

We set 120 seconds.
Idea that client should close socket by itself every 60 seconds and it don't afraid that server unexpectedly closed the socket on their side.

@slvrtrn
Copy link
Contributor

slvrtrn commented Jan 14, 2024

@breslavsky, could you DM me in the community Slack?

@slvrtrn slvrtrn unpinned this issue Mar 16, 2024
@slvrtrn
Copy link
Contributor

slvrtrn commented Mar 18, 2024

Closing this, as it should be fixed as of 0.3.0. If there are still issues with sockets after upgrading, please open a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests