Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstable connections to containers #740

Closed
1 of 5 tasks
alxsad opened this issue Jun 14, 2023 · 11 comments
Closed
1 of 5 tasks

Unstable connections to containers #740

alxsad opened this issue Jun 14, 2023 · 11 comments
Milestone

Comments

@alxsad
Copy link

alxsad commented Jun 14, 2023

Description

After update colima to v0.5.5 I have some troubles with my services started by docker-compose. For example I try to connect to postgres in a container. With etcd container the same.

psql: error: connection to server at "localhost" (127.0.0.1), port 5432 failed: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

This behaviour has no any pattern. At least one try from 10 works correctly. After downgrading colima to 0.5.4 everything is ok.

Version

Colima Version: 0.5.5
Lima Version: 0.16.0
Qemu Version: 8.0.2

Operating System

  • macOS Intel <= 12 (Monterrey)
  • macOS Intel >= 13 (Ventura)
  • macOS M1 <= 12 (Monterrey)
  • macOS M1 >= 13 (Ventura)
  • Linux

Output of colima status

INFO[0000] colima [profile=finteqhub] is running using macOS Virtualization.Framework
INFO[0000] arch: aarch64
INFO[0000] runtime: docker
INFO[0000] mountType: virtiofs
INFO[0000] socket: unix:///Users/alxsad/.colima/finteqhub/docker.sock

Reproduction Steps

1. colima start --profile finteqhub --cpu 8 --memory 10 --disk 200 --vm-type=vz --vz-rosetta --runtime docker --mount-type=virtiofs
2. docker-compose up -d
3. psql -h localhost -U postgres -c 'select 1'

Expected behaviour

postgres should work stable without any issues

Additional context

version: '3.9'

services:
  postgres:
    image: postgres:14.8
    restart: on-failure
    volumes:
      - pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: example
      PGDATA: /var/lib/postgresql/data/pgdata
    user: root
    ports:
      - '0.0.0.0:5432:5432'

volumes:
  pgdata:
@rfay
Copy link
Contributor

rfay commented Jun 14, 2023

You'll probably have to explore the suggestion it made: "This probably means the server terminated abnormally"

There's likely nothing Colima can do to help you sort this out. See if the server exited though... You may want to turn off your restart: on-failure so you can get more info.

AFAIK DDEV's postgresql handling is fine with latest colima.

Of course using --vz-rosetta puts you in the most fragile possible situation. Postgres 14 has arm64 images, so why bother with using rosetta? And of course, vz itself adds complexity.

@alxsad
Copy link
Author

alxsad commented Jun 14, 2023

You'll probably have to explore the suggestion it made: "This probably means the server terminated abnormally"

There's likely nothing Colima can do to help you sort this out. See if the server exited though... You may want to turn off your restart: on-failure so you can get more info.

AFAIK DDEV's postgresql handling is fine with latest colima.

Of course using --vz-rosetta puts you in the most fragile possible situation. Postgres 14 has arm64 images, so why bother with using rosetta? And of course, vz itself adds complexity.

But in previous version (v0.5.4) everything work correctly. I think something changed in v0.5.5. Additionally, container does not restart during this error.

@emanuil-tolev
Copy link

Did you update your OS recently as well, or can you reproduce that in 0.5.4 everything is (still) fine? There have been some Mac OS updates which affected Lima network connectivity, but it might be something quite different.

@alxsad
Copy link
Author

alxsad commented Jun 21, 2023

Did you update your OS recently as well, or can you reproduce that in 0.5.4 everything is (still) fine? There have been some Mac OS updates which affected Lima network connectivity, but it might be something quite different.

I did not update major version of Mac OS recently. I updated dependencies only using Homebrew. So, there is only one diff between prev state of my system and current state - is new version of colima, and other related dependecncies.

@alxsad
Copy link
Author

alxsad commented Jun 21, 2023

found similar issue - docker/compose#10673

@mfrister
Copy link

mfrister commented Jul 3, 2023

I had a similar issue on an M2 MacBook Air - maybe this helps someone: I had a test suite running on macOS, connecting to services running in Docker via colima. The test suite wasn't shutting down connections to the services properly, so more and more connections accumulated with each test, ending up at > 300 connections (according to lsof on macOS, mostly TCP, all of them in 'established' state).

Roughly at that point, always after the exact same number of tests, no matter if I disabled some (so probably after the same amount of connections was open and not related individual tests), networking to the Docker services broke down until the test process was finished and the OS closed all connections. Networking broke (even existing connections were closed) both via gvproxy and slirp networking, both to an arm64 and an x86 VM (emulated via qemu).

After fixing connection shutdown in the test suite, everything runs fine now. So there seems to be some kind of connection limit around 300 connections, where some component used by colima breaks down. I couldn't find anything in kernel logs like OOM or networking issues, not in Docker logs, nor were the containers restarting.

@rogeriomgatto
Copy link

I think I'm hitting the same limit @mfrister mentioned... In my case, it's not a connection leak, but an application with a lot of kafka consumers connecting to multiple kafka instances with docker compose.

I can access a redpanda instance with HTTP before my service starts, and after my service quits, but not while consumers are up. A similar setup in another service but with less consumers (and therefore less connections) works fine.

Perhaps some limit on port forwarding in ssh?

@rogeriomgatto
Copy link

More details:

2019 Intel MacBook Pro, MacOS Ventura 13.5.2 (22G91)
colima version 0.5.5
limactl version 0.17.2

Problem happened with ~ 260 sockets in ssh process, including listening sockets.

@abiosoft abiosoft added this to the v0.6.0 milestone Nov 12, 2023
@smiklos
Copy link

smiklos commented Feb 18, 2024

This is still very much an issue. I'm running the latest Colima v0.6.8 , Mac air m1 using vz vm (macOS 13). This also happens on qemu vm.

For me the at above around 240 connections to my container the connections start to break down and existing connections to my container also break at this point (e.g VisualVM disconnects from the java process).

Spent ages tuning my test but it's a fairly simple suite. In my tests I see a bunch of "connection reset by peer" and "dial error". I'm using k6 and it seems to create a connection for each virtual user and so even if I ask for a rate of 1 request per second this issue still happens because I can configure the test suite to lets say start with 250 virtual users and so it will try and create those connections to the container and will fail to keep them connected.

I observed no issues when running my java process without using a container

@smiklos
Copy link

smiklos commented Feb 18, 2024

Fascinating realisation! this happens because of using localhost/127.0.0.1 vs the ip assigned to the Colima vm (needs enabling). When using the ip of the vm it works just fine

@ManFromSiberia
Copy link

@smiklos it's really work for me! Previously my requests to the service could reach 5 seconds, in the course of debug I found out that the request to the database takes very long, and when I changed the database host to ip colima ip everything started to work perfectly

jesse-c pushed a commit to SeldonIO/MLServer that referenced this issue May 30, 2024
* build: Lock GitHub runners' OS

This was motivated by our macOS jobs failing [2] because
colima is missing. It looks like this is because the
latest versions of the macOS runner no longer have
colima installed by default [1].

colima is now explicitly installed.

[1] actions/runner-images#6216
[2] `/Users/runner/work/_temp/f19ffbff-27a9-4fc7-80b6-97791d2de141.sh: line 9: colima: command not found`

* build: Lock Colima

* build: Move macOS Docker installation to script

* build: Move macOS libomp activation to script

* build: Use latest Colima

The > 0.6.0 releases actually fix the issue we have linked [1][2][3].

[1] abiosoft/colima#577
[2] https://github.com/jesse-c/MLServer/blob/c3acd60995a72141027eff506e4fd330fe824179/hack/install-docker-macos.sh#L18-L20
[3] > Switch to new user-v2 network. Fixes abiosoft/colima#648, abiosoft/colima#603, abiosoft/colima#577, abiosoft/colima#779, abiosoft/colima#137, abiosoft/colima#740.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants