Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It eats tremendous amout of memory on high load #134

Closed
softaria-roman opened this issue May 23, 2016 · 19 comments
Closed

It eats tremendous amout of memory on high load #134

softaria-roman opened this issue May 23, 2016 · 19 comments

Comments

@softaria-roman
Copy link

softaria-roman commented May 23, 2016

Hello
I tried to use it for real time MMO game. And got a problem with garbage collector working constantly.
With 1000 clients each sending and receiving 100 messages per second library generates ~10 Gb of memory in 2 minutes.
In contrast, google's x/net/websocket generates about 10 times less work for gc.
Here is 2 first line from profiler gathered during 2 min of work:

(pprof) top10
12.02GB of 12.47GB total (96.38%)
Dropped 63 nodes (cum <= 0.06GB)
Showing top 10 nodes out of 35 (cum >= 0.10GB)
flat flat% sum% cum cum%
6.73GB 53.94% 53.94% 6.73GB 53.94% bytes.makeSlice
2.30GB 18.45% 72.38% 9.02GB 72.38% io/ioutil.readAll

Here is a place where the memory usage get generated
g1
g2

It could be easily fixed when you would allow passing pre-allocated buffers as arguments to readMessage instead of allocating it per each frame.

I am ready to provide any further information when you are interested.

@garyburd
Copy link
Contributor

garyburd commented May 23, 2016

ReadMessage is a helper method for getting a reader using NextReader and reading from that reader to a buffer.

Call NextReader directly and pass the preallocated buffer to the Read method.

@softaria-roman
Copy link
Author

Thank you. This way reduces memory usage dramatically. Sorry for not finding it myself.
But there is still room to improve:

ROUTINE ======================== github.com/gorilla/websocket.(_Conn).NextReader in C:/Users/roman/gocode/src/github.com/gorilla/websocket/conn.go
67.50MB 103MB (flat, cum) 0.81% of Total
. . 807:
. . 808: c.readSeq++
. . 809: c.readLength = 0
. . 810:
. . 811: for c.readErr == nil {
. 35.50MB 812: frameType, err := c.advanceFrame()
. . 813: if err != nil {
. . 814: c.readErr = hideTempErr(err)
. . 815: break
. . 816: }
. . 817: if frameType == TextMessage || frameType == BinaryMessage {
67.50MB 67.50MB 818: return frameType, messageReader{c, c.readSeq}, nil
. . 819: }
. . 820: }
. . 821:
. . 822: // Applications that do handle the error returned from this method spin in
. . 823: // tight loop on connection failure. To help application developers detect
(pprof) list NextWriter
Total: 12.47GB
ROUTINE ======================== github.com/gorilla/websocket.(_Conn).NextWriter in C:/Users/roman/gocode/src/github.com/gorilla/websocket/conn.go
341.51MB 341.51MB (flat, cum) 2.67% of Total
. . 409: if !isControl(messageType) && !isData(messageType) {
. . 410: return nil, errBadWriteOpCode
. . 411: }
. . 412:
. . 413: c.writeFrameType = messageType
341.51MB 341.51MB 414: return messageWriter{c, c.writeSeq}, nil
. . 415:}
. . 416:
. . 417:func (c *Conn) flushFrame(final bool, extra []byte) error {
. . 418: length := c.writePos - maxFrameHeaderSize + len(extra)
. . 419:
(pprof)

Would be great to introduce some way to not instantiate messageReader and messageWritr per each request. E.g. let caller to pass some kind of factory which creates messageReader and messageWriter instances. This way I would be able to use pre-allocated messageReader and messageWriter and improve it even further.

@softaria-roman
Copy link
Author

or just use sync.Pool to hold messageWriter and messageReader instances?

@garyburd
Copy link
Contributor

CL 8b209f6 reduces the amount of memory allocated by NextReader and NextWriter.

Readers and writers are discarded to prevent read past the end of a message and write after a message is closed. A pool of two readers and two writers per connection is probably sufficient to prevent these errors.

@softaria-roman
Copy link
Author

Much better now.
NextReader is no longer in my top10 memory users.
For nextWriter it is following (with garbage collector turned off and during 2 minutes of work)

ROUTINE ======================== vendor/github.com/gorilla/websocket.(*Conn).NextWriter in C:/Projects/spacewars.io/spacewar.go/src/vendor/github.com/gorilla/websocket/conn.go
164MB 164MB (flat, cum) 8.67% of Total
. . 412: if !isControl(messageType) && !isData(messageType) {
. . 413: return nil, errBadWriteOpCode
. . 414: }
. . 415:
. . 416: c.writeFrameType = messageType
164MB 164MB 417: w := &messageWriter{c}
. . 418: c.messageWriter = w
. . 419: return w, nil
. . 420:}
. . 421:
. . 422:func (c *Conn) flushFrame(final bool, extra []byte) error {

It is already acceptable for my goals. Thank you.
Please consider if it is reasonable to place messageWriters to sync.Pool in order to improve memory usage even further. (Maybe it is not as sync.Pool would probably introduce delay)

@garyburd
Copy link
Contributor

A two element pool per type per connection is better than a sync.Pool because

  • The connection can ensure that the values alternate as they are used.
  • There's no crosstalk between connections. The crosstalk can make application errors difficult to debug.

@softaria-roman
Copy link
Author

I agree. Do you mean a pool of 2 elements of messageWriter{} ?

@softaria-roman
Copy link
Author

Ups. As I think now, you mean not to call NextReader and NextWriter per each read and write. Instead, create 2 readers and 2 writers and switch between them so each odd write uses the first Writer and each even write uses the second. Is my current understanding correct? If so, I will make the test.

@garyburd
Copy link
Contributor

The proposal is that NextWriter alternates between two *messageWriter values and NextReader alternates between two *messageReader values.

I prefer not to do this because a stale io.WriterCloser or io.Reader returned by these APIs can become active again. Since the memory use is acceptable for your use, let's leave the code as is.

@softaria-roman
Copy link
Author

Ok. Thank you.

@marianogenovese
Copy link

marianogenovese commented Jun 18, 2017

Hi, I am relatively new to developing in go, and I would like to know if it is possible to explain a bit more how to make better use of memory with gorilla since I am developing a websocket server and I have a high memory consumption when the server has> 300K concurrent connections. Maybe with some code example?

Thanks.

cc: @garyburd @softaria

@elithrar
Copy link
Contributor

elithrar commented Jun 18, 2017 via email

@marianogenovese
Copy link

marianogenovese commented Jun 19, 2017

Hi @elithrar sorry for delay, as i said im new in go. Info related to your questions:
I give more context:
a. Im developing a websocket server, and im testing some languages like (nodejs, go, c++, c#)
I would like to choose the best language for good performance and productivity.
b. my goal is to handle 500K connections and to handle an equal or close number in RPS
c. the test that I am doing now as you see in my code (point 3) is a simple echo.
d. This test info is running with 200K clients.
e. Enviroment: Linux Debian with go1.8.1 linux/amd64
1, 2.

Entering interactive mode (type "help" for commands)
(pprof) top
478.40MB of 482.48MB total (99.15%)
Dropped 88 nodes (cum <= 2.41MB)
Showing top 10 nodes out of 35 (cum >= 8.51MB)
     flat  flat%   sum%        cum   cum%
 321.64MB 66.66% 66.66%   327.64MB 67.91%  github.com/gorilla/websocket.newConnBRW
  71.03MB 14.72% 81.38%    71.03MB 14.72%  runtime.malg
  26.50MB  5.49% 86.88%   101.25MB 20.99%  runtime.systemstack
  15.50MB  3.21% 90.09%    15.50MB  3.21%  syscall.anyToSockaddr
     14MB  2.90% 92.99%       43MB  8.91%  net.(*netFD).accept
  13.50MB  2.80% 95.79%    13.50MB  2.80%  net.sockaddrToTCP
   5.01MB  1.04% 96.83%     5.01MB  1.04%  bytes.makeSlice
      4MB  0.83% 97.66%        4MB  0.83%  github.com/gorilla/websocket.(*Conn).SetPingHandler
   3.72MB  0.77% 98.43%     3.72MB  0.77%  runtime.allgadd
   3.50MB  0.73% 99.15%     8.51MB  1.76%  io/ioutil.readAll

image

Entering interactive mode (type "help" for commands)
(pprof) top
37.28s of 69.67s total (53.51%)
Dropped 350 nodes (cum <= 0.35s)
Showing top 10 nodes out of 167 (cum >= 1.84s)
      flat  flat%   sum%        cum   cum%
    13.62s 19.55% 19.55%     14.80s 21.24%  syscall.Syscall
     7.40s 10.62% 30.17%      7.40s 10.62%  runtime.futex
     4.31s  6.19% 36.36%      4.31s  6.19%  runtime._ExternalCode
     3.05s  4.38% 40.73%      3.05s  4.38%  runtime.epollwait
     2.23s  3.20% 43.94%      8.77s 12.59%  runtime.gentraceback
     1.45s  2.08% 46.02%      1.47s  2.11%  runtime.greyobject
     1.43s  2.05% 48.07%      3.59s  5.15%  runtime.pcvalue
     1.37s  1.97% 50.04%      8.72s 12.52%  runtime.mallocgc
     1.23s  1.77% 51.80%      2.24s  3.22%  runtime.scanobject
     1.19s  1.71% 53.51%      1.84s  2.64%  runtime.lock
(pprof)
upgrader = websocket.Upgrader{
		EnableCompression: false,
		ReadBufferSize:    512,
		WriteBufferSize:   512,
		CheckOrigin: func(r *http.Request) bool {
			return true
		},
		HandshakeTimeout: 5 * time.Minute,
	} // use default options
func echoAsync(w http.ResponseWriter, r *http.Request) {
	socket, err := upgrader.Upgrade(w, r, nil)
	if err != nil {
		log.Print("upgrade:", err)
		return
	}
	go connectionHandler(socket)
}

func connectionHandler(socket *websocket.Conn) {
	defer func() {
		socket.Close()
		atomic.AddInt64(&connected, -1)
	}()
	atomic.AddInt64(&connected, 1)
	for {

		mt, message, err := socket.ReadMessage()
		if err != nil {
			log.Println("read:", err)
			break
		}
		//log.Printf("recv: %s", message)
		err = socket.WriteMessage(mt, message)
		if err != nil {
			log.Println("write:", err)
			break
		}
		atomic.AddInt64(&writes, 1)
	}
}

image

As you can see here, go handle 200K concurrent connections (connected=#) and writes show RPS
numbers here are equal in node.js for example but i think go can do better with less overhead and much faster.

As i said before im new in go and my question is related not what language choose, is related to what can i do better to improve and my code and get a right picture of the power of golang. In the image another message is printed:

read: websocket: close 1006 (abnormal closure): unexpected EOF

This print show when the server is stressed.

Another point to understand is the unbounded concurrency and how this related with my test. For example in NodeJS, i use a batch to handle N requests from clients and achieve better results.

Thanks.

PD:
mem.svg
https://cdn.rawgit.com/marianogenovese/d75ccd83e94dde2629d72e88bf7e5538/raw/f600d8896266b8a6eeec1d1aff041c1af3516830/mem.svg

cpu.svg
https://cdn.rawgit.com/marianogenovese/e15bf771575e3ceeb597f343b1ba8265/raw/4844be2605e00537e78b1bb8ef199081244f3fde/cpu.svg

@garyburd
Copy link
Contributor

garyburd commented Jun 20, 2017

@marianogenovese The Upgrade method allocated 1790 bytes per connection. That seems reasonable.

The close 1006 (abnormal closure): unexpected EOF error indicates that the peer closed the connection. What is the peer?

@marianogenovese
Copy link

Hi @garyburd i have a .net core client to generate load to my server, I have increased the connection timeout on my client and thus have fewer 1006 errors this works but the cpu usage is still high. More connections more CPU more contention.

@marianogenovese
Copy link

marianogenovese commented Jun 21, 2017

in previous comments my gorilla webserver create a goroutine per connection, i have changed my server without goroutine like this:

func echoAsync2(w http.ResponseWriter, r *http.Request) {
	socket, err := upgrader.Upgrade(w, r, nil)
	if err != nil {
		log.Print("upgrade:", err)
		return
	}
	defer func() {
		fmt.Printf("Connection end\n")
		socket.Close()
		atomic.AddInt64(&connected, -1)
	}()

	atomic.AddInt64(&connected, 1)
	socket.SetReadLimit(100)
	for {
		mt, message, err := socket.ReadMessage()
		if err != nil {
			log.Println("read:", err)
			break
		}

		err = socket.WriteMessage(mt, message)
		if err != nil {
			log.Println("write:", err)
			break
		}

		atomic.AddInt64(&writes, 1)
		runtime.Gosched()
	}
}

Memory now with 250K increase compared with previous profiles:

Entering interactive mode (type "help" for commands)
(pprof) top
2944.37MB of 3167.97MB total (92.94%)
Dropped 73 nodes (cum <= 15.84MB)
Showing top 10 nodes out of 38 (cum >= 47.50MB)
      flat  flat%   sum%        cum   cum%
 1024.92MB 32.35% 32.35%  1024.92MB 32.35%  net/http.newBufioReader
  961.68MB 30.36% 62.71%   961.68MB 30.36%  net/http.newBufioWriterSize
  392.67MB 12.39% 75.10%   401.17MB 12.66%  github.com/gorilla/websocket.newConnBRW
  119.03MB  3.76% 78.86%   119.03MB  3.76%  runtime.mapassign
  100.54MB  3.17% 82.03%   100.54MB  3.17%  runtime.malg
   91.51MB  2.89% 84.92%   138.51MB  4.37%  context.WithCancel
   88.02MB  2.78% 87.70%   456.08MB 14.40%  net/http.(*conn).readRequest
   63.52MB  2.00% 89.71%   267.55MB  8.45%  net/http.readRequest
      55MB  1.74% 91.44%   160.12MB  5.05%  runtime.systemstack
   47.50MB  1.50% 92.94%    47.50MB  1.50%  runtime.rawstringtmp
(pprof)
Entering interactive mode (type "help" for commands)
(pprof) top
707.73s of 1124.61s total (62.93%)
Dropped 450 nodes (cum <= 5.62s)
Showing top 10 nodes out of 123 (cum >= 13.04s)
      flat  flat%   sum%        cum   cum%
   289.16s 25.71% 25.71%    309.83s 27.55%  syscall.Syscall
   150.36s 13.37% 39.08%    150.36s 13.37%  runtime.futex
    79.04s  7.03% 46.11%     79.04s  7.03%  runtime._ExternalCode
    70.83s  6.30% 52.41%     70.83s  6.30%  runtime.epollwait
    31.30s  2.78% 55.19%     60.17s  5.35%  runtime.lock
    22.27s  1.98% 57.17%     26.04s  2.32%  runtime.deferreturn
    18.38s  1.63% 58.81%    106.33s  9.45%  runtime.mallocgc
    18.12s  1.61% 60.42%     18.25s  1.62%  runtime.casgstatus
    15.39s  1.37% 61.79%       283s 25.16%  runtime.findrunnable
    12.88s  1.15% 62.93%     13.04s  1.16%  runtime.greyobject

-alloc_objects

Entering interactive mode (type "help" for commands)
(pprof) top
189200968 of 197963140 total (95.57%)
Dropped 75 nodes (cum <= 989815)
Showing top 10 nodes out of 36 (cum >= 1220162)
      flat  flat%   sum%        cum   cum%
  57475510 29.03% 29.03%   73021531 36.89%  github.com/gorilla/websocket.(*Conn).NextReader
  54913204 27.74% 56.77%  109414124 55.27%  io/ioutil.readAll
  54500923 27.53% 84.30%   54500923 27.53%  bytes.makeSlice
   8001341  4.04% 88.35%    8001341  4.04%  runtime.makechan
   3874437  1.96% 90.30%   15546021  7.85%  github.com/gorilla/websocket.(*Conn).WriteControl
   3670240  1.85% 92.16%   11671584  5.90%  time.NewTimer
   2801719  1.42% 93.57%    2801719  1.42%  runtime.rawstringtmp
   1706756  0.86% 94.43%    2216526  1.12%  context.WithCancel
   1290127  0.65% 95.09%    1847191  0.93%  github.com/gorilla/websocket.newConnBRW
    966711  0.49% 95.57%    1220162  0.62%  runtime.systemstack

@garyburd
Copy link
Contributor

The first example starts a new goroutine and returns from the goroutine started by the HTTP server. This allows GC of per connection data created by the HTTP server.

The second example executes on the goroutine started by the HTTP server.

There is one goroutine serving the connection in both examples.

@marianogenovese
Copy link

marianogenovese commented Jun 21, 2017

Ok I understand, but I still do not understand the difference between profiles.

@garyburd
Copy link
Contributor

@marianogenovese What are the specific differences you are concerned about?

@gorilla gorilla locked and limited conversation to collaborators Feb 14, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants