Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not use keywords for headers #1

Closed
deadtrickster opened this issue Oct 13, 2014 · 16 comments
Closed

Do not use keywords for headers #1

deadtrickster opened this issue Oct 13, 2014 · 16 comments

Comments

@deadtrickster
Copy link
Contributor

While issue not SBCL specific this code is:

(ql:quickload :fast-http)
(ql:quickload :cl-interpol)
(ql:quickload :trivial-utf-8)
(cl-interpol:enable-interpol-syntax)

(defun random-garbage (length)
           (let ((chars "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789")
                 (garbage (make-string length)))
             (dotimes (i length)
               (setf (aref garbage i) (aref chars (random (length chars)))))
             garbage))

(defun test-memory-leak (iterations)
           (loop for i from 0 to iterations do
             (let* ((request (list #?"POST / HTTP/1.1\r\n"
                              #?"Host: www.example.com\r\n"
                              #?"Content-Type${(random-garbage 1000)}: application/x-www-form-urlencoded\r\n"
                              #?"Content-Length: 4\r\n"
                              #?"Connection: close\r\n"
                              #?"\r\n"
                              #?"q=42\r\n"))
                    (http (fast-http:make-http-request :store-body t ))
                    (parser (fast-http:make-parser http :store-body t :header-callback (lambda (h) (declare (ignore h)) ))))
               (loop for req-chunk in request do              
                 (funcall parser (trivial-utf-8:string-to-utf-8-bytes req-chunk))))))

(define-alien-routine print-generation-stats void)
(sb-ext:gc :full t)
(print-generation-stats)

;; MAGIC
(test-memory-leak 200000) ;; maybe you should adapt this constant to your memory settings

Output

 Load 1 ASDF system:
    fast-http
; Loading "fast-http"
..
(:FAST-HTTP)
* To load "cl-interpol":
  Load 1 ASDF system:
    cl-interpol
; Loading "cl-interpol"
...
#<ASDF/SYSTEM:SYSTEM "cl-interpol"> 
(:CL-INTERPOL)
* To load "trivial-utf-8":
  Load 1 ASDF system:
    trivial-utf-8
; Loading "trivial-utf-8"

(:TRIVIAL-UTF-8)
* 
* 
RANDOM-GARBAGE
* 
TEST-MEMORY-LEAK
* 
PRINT-GENERATION-STATS
* 
NIL
*  Gen StaPg UbSta LaSta LUbSt Boxed Unboxed LB   LUB  !move  Alloc  Waste   Trig    WP  GCs Mem-age
   0:     0     0     0     0     1     0     0     0     0        0 32768 10737418    0   0  0.0000
   1:     0     0     0     0     0     0     0     0     0        0     0 10737418    0   0  0.0000
   2:     0     0     0     0     0     0     0     0     0        0     0 10737418    0   0  0.0000
   3:     0     0     0     0     0     0     0     0     0        0     0 10737418    0   0  0.0000
   4:     0     0     0     0     0     0     0     0     0        0     0 10737418    0   0  0.0000
   5:     0     0     0     0   636   357    68   128     6 37547216 1413936 48284634  698   1  0.0000
   6:     0     0     0     0  1702   155     0     0     0 60850176     0  2000000 1608   0  0.0000
   Total bytes allocated    = 98397392
   Dynamic-space-size bytes = 1073741824

NIL
*
Heap exhausted during garbage collection: 128 bytes available, 480 requested.
 Gen StaPg UbSta LaSta LUbSt Boxed Unboxed LB   LUB  !move  Alloc  Waste   Trig    WP  GCs Mem-age
   0:     0     0     0     0     0     0     0     0     0        0     0 10737418    0   0  0.0000
   1:     0     0     0     0     0     0     0     0     0        0     0 10737418    0   0  0.0000
   2:     0     0     0     0     0     0     0     0     0        0     0 10737418    0   0  0.0000
   3: 15219 23801     0     0  1350 13413   348    44     0 494612320 1986720 10737418    0   0  0.9943
   4:     0     0     0     0     0     0     0     0     0        0     0 10737418    0   0  0.0000
   5:     0     0     0     0   636   357    68   128     6 37547216 1413936 48284634  694   1  0.0000
   6:     0     0     0     0  1702   155     0     0     0 60850176     0  2000000 1613   0  0.0000
   Total bytes allocated    = 1068602064
   Dynamic-space-size bytes = 1073741824
GC control variables:
   *GC-INHIBIT* = true
   *GC-PENDING* = true
   *STOP-FOR-GC-PENDING* = false
fatal error encountered in SBCL pid 57429(tid 140737353971520):
Heap exhausted, game over.

Welcome to LDB, a low-level debugger for the Lisp runtime environment.
ldb>

PS. funny thing - chunga and its friends do the same mistake.

@fukamachi
Copy link
Owner

Hmm, this must be serious.

But a property list with keyword keys is useful for high-level layer. How about just limiting the length of header fields?

FYI, fast-http provides low-level API like make-ll-parser. It provides only a byte vector and indices of the beginning and ending.

@deadtrickster
Copy link
Contributor Author

Let's say we set length limit to 20, then if I'm correct the number of possible strings with 20 chars is at least 4.239115828×10²⁸ (27^20, we are ignoring case so at least 26 alphabet characters plus #'-) or at least 8.478231655×10²⁹ bytes. Then again if I'm correct we can continue adding with strings with 19 chars and so on.

Taking this into account I don't think symbols here are actually usable, what we're using internally (since people really love keywords :-) is https://github.com/deadtrickster/ia-hash-table based on the idea stolen from RoR.

As for low-level things, I glanced through code and it looks like you're using babel with default encoding (for me it's utf-8). I observed that trivial-utf-8 is slightly faster:

(time (loop for i from 0 to 1000000 do
           (babel:octets-to-string #(113 119 101 113 119 101 113 119 101 113 119 101 113 119 114 113 119 101 113
  119 101 116 113 116 119 101 116 113 119 101 116 113 119 101 116))))
Evaluation took:
  0.724 seconds of real time
  1.396658 seconds of total run time (1.269276 user, 0.127382 system)
  [ Run times consist of 0.056 seconds GC time, and 1.341 seconds non-GC time. ]
  192.96% CPU
  1,933,973,114 processor cycles
  235,482,272 bytes consed
(time (loop for i from 0 to 1000000 do
           (trivial-utf-8:utf-8-bytes-to-string #(113 119 101 113 119 101 113 119 101 113 119 101 113 119 114 113 119 101 113
  119 101 116 113 116 119 101 116 113 119 101 116 113 119 101 116))))
Evaluation took:
  0.550 seconds of real time
  1.061923 seconds of total run time (0.974123 user, 0.087800 system)
  [ Run times consist of 0.039 seconds GC time, and 1.023 seconds non-GC time. ]
  193.09% CPU
  1,468,940,292 processor cycles
  230,651,344 bytes consed

Maybe, UTF-8 is overkill for header fields, urls?

@fukamachi
Copy link
Owner

20 chars limit is too strict. I choose 30 for the default and allow to change it by a special variable.

@deadtrickster
Copy link
Contributor Author

+http-max-header-size+ is 8kb (and it looks like it is common thing) so limit for header field name is just useless.

(ql:quickload :fast-http)
(ql:quickload :cl-interpol)
(ql:quickload :trivial-utf-8)
(cl-interpol:enable-interpol-syntax)

(defun random-garbage (length)
  (let ((chars "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-")
        (garbage (make-string length)))
    (dotimes (i length)
      (setf (aref garbage i) (aref chars (random (length chars)))))
    garbage))

(defun test-memory-leak (iterations)
  (loop for i from 1 to iterations do
        (let* ((request (list #?"POST / HTTP/1.1\r\n"
                              #?"Host: www.example.com\r\n"
                              #?"${(random-garbage 30)}: application/x-www-form-urlencoded\r\n"
                              #?"${(random-garbage 30)}: garbage\r\n"
                              #?"${(random-garbage 30)}: w-form-urlencoded\r\n"
                              #?"${(random-garbage 30)}: application/xurlencoded\r\n"
                              #?"${(random-garbage 30)}: applicatiurlencoded\r\n"
                              #?"${(random-garbage 30)}: application/x-wwwm-urlencoded\r\n"
                              #?"${(random-garbage 30)}: applicatwww-form-urlencoded\r\n"
                              #?"${(random-garbage 30)}: applicaform-urlencoded\r\n"
                              #?"${(random-garbage 30)}: applicationw-form-urlencoded\r\n"
                              #?"${(random-garbage 30)}: appltiorm-urlencoded\r\n"
                              #?"${(random-garbage 30)}: appliorm-urlencoded\r\n"
                              #?"${(random-garbage 30)}: appatiorm-lencoded\r\n"
                              #?"${(random-garbage 30)}: applitiorm-lencoded\r\n"
                              #?"${(random-garbage 30)}: applicatiorm-urlencoded\r\n"
                              #?"${(random-garbage 30)}: applicatiorm-\r\n"
                              #?"${(random-garbage 30)}: applitiorm-urlencoded\r\n"
                              #?"${(random-garbage 30)}: appliorm-urlencoded\r\n"
                              #?"${(random-garbage 30)}: applicrm-urlncoded\r\n"
                              #?"${(random-garbage 30)}: applicatiorm-\r\n"
                              #?"Content-Length: 4\r\n"
                              #?"Connection: close\r\n"
                              #?"\r\n"
                              #?"q=42\r\n"))
               (http (fast-http:make-http-request :store-body t ))
               (parser (fast-http:make-parser http :store-body t :header-callback (lambda (h) (declare (ignore h)) ))))
          (loop for req-chunk in request do
                (funcall parser (trivial-utf-8:string-to-utf-8-bytes req-chunk))))
        (if (= 0 (mod i 10000))
            (format t "~a Iterations~%"  i))))

(define-alien-routine print-generation-stats void)
(sb-ext:gc :full t)
(print-generation-stats)

;; MAGIC
(test-memory-leak 20000000) ;; maybe you should adapt this constant to your memory settings
*  Gen StaPg UbSta LaSta LUbSt Boxed Unboxed LB   LUB  !move  Alloc  Waste   Trig    WP  GCs Mem-age
   0:     0     0     0     0     1     0     0     0     0        0 32768 10737418    0   0  0.0000
   1:     0     0     0     0     0     0     0     0     0        0     0 10737418    0   0  0.0000
   2:     0     0     0     0     0     0     0     0     0        0     0 10737418    0   0  0.0000
   3:     0     0     0     0     0     0     0     0     0        0     0 10737418    0   0  0.0000
   4:     0     0     0     0     0     0     0     0     0        0     0 10737418    0   0  0.0000
   5:     0     0     0     0   766   390    68   128     6 42824000 1478336 53561418  828   1  0.0000
   6:     0     0     0     0  1702   155     0     0     0 60850176     0  2000000 1608   0  0.0000
   Total bytes allocated    = 103674176
   Dynamic-space-size bytes = 1073741824

NIL
* (test-memory-leak 20000000)
10000 Iterations
20000 Iterations
30000 Iterations
40000 Iterations
50000 Iterations
60000 Iterations
70000 Iterations
80000 Iterations
90000 Iterations
100000 Iterations
110000 Iterations
120000 Iterations
130000 Iterations
140000 Iterations
150000 Iterations
160000 Iterations
170000 Iterations
180000 Iterations
Heap exhausted during garbage collection: 80 bytes available, 144 requested.
 Gen StaPg UbSta LaSta LUbSt Boxed Unboxed LB   LUB  !move  Alloc  Waste   Trig    WP  GCs Mem-age
   0:     0     0     0     0     0     0     0     0     0        0     0 10737418    0   0  0.0000
   1:     0     0     0     0     0     0     0     0     0        0     0 10737418    0   0  0.0000
   2:  9535 12985     0     0  1682  5025     0     0     0 219178832 596144 239756250    0   1  1.0449
   3: 26391 32767     0     0  5296 13910  3240   406    23 747243600 1570736 10737418 4063   0  1.0775
   4:     0     0     0     0     0     0     0     0     0        0     0 10737418    0   0  0.0000
   5:     0     0     0     0   766   390    68   128     6 42824000 1478336 53561418  823   1  0.0000
   6:     0     0     0     0  1702   155     0     0     0 60850176     0  2000000 1612   0  0.0000
   Total bytes allocated    = 1070096608
   Dynamic-space-size bytes = 1073741824
GC control variables:
   *GC-INHIBIT* = true
   *GC-PENDING* = true
   *STOP-FOR-GC-PENDING* = false
fatal error encountered in SBCL pid 55548(tid 140737353971520):
Heap exhausted, game over.

Welcome to LDB, a low-level debugger for the Lisp runtime environment.
ldb> 

less than 190000 iterations!
In other word one can just increase header fields count instead of singe header field name length.
Just don't use keywords!

@fukamachi
Copy link
Owner

You're right.
I couldn't make a right decision because it's too serious and affects many web stuffs like Clack that uses a property list as a representation of HTTP headers.

I think we should just stop using keywords in every other projects, Clack, Caveman2, ningle, clack-errors, and maybe hermetic?
Also Wookie uses a plist for headers. CC @orthecreedence

@fukamachi
Copy link
Owner

Summary

  • Interning HTTP header fields to keywords exhausts memory easily.
  • fast-http's make-parser would be better to stop interning HTTP header fields.
    • This will be a new API difference from http-parse by orthecreedence.
  • Other projects using keywords for HTTP header field would be better to change the behaviour too.

@orthecreedence
Copy link

I'm fine with updating this in Wookie once moving to fast HTTP (thanks for the heads up) as well as making a general announcement about the change.

Suggestion: can we lowercase the header keys in the hash table? Seems to work well for the Node folks, and I don't think I've ever seen a case where both "Content-Type" and "content-type" are together in the same request.

fukamachi added a commit that referenced this issue Oct 17, 2014
* The all keys are downcased.
* If there're multiple header line which has the same key, the values will be combined with ', ' (RFC 2616)
@fukamachi
Copy link
Owner

Okay, I changed the data type of HTTP headers to hash-table in the above commit.

The all keys are downcased as orthecreedence said. If there're multiple header lines which have the same key, the values will be combined with ', '. (RFC 2616)

@fukamachi
Copy link
Owner

I forgot to say, thank you @deadtrickster for your kind warning by showing clear examples.

@deadtrickster
Copy link
Contributor Author

Fukamachi, thank you for fix, however :-)
Consider this:

  • Many people (including me) would like to write X-Content-Type-Options instead of x-content-type-options
  • (gethash "length" headers) /= (gethash "Length" headers) which is actually very strange - by RFC header field names are case insensitive.

Of course we can force users with lowercase strings, but maybe there something we can do about?

(we can play with test and hash functions, test function obviously will be string-equal, and hash function can be probably stolen from say sbcl [%sxhash-substring] and modified to ignore case)

@fukamachi
Copy link
Owner

Interesting point.

What do you think about adding a special variable or a keyword parameter to specify the case of header fields, like :downcase (default), :upcase or :capitalize?

(make-parser http
             :header-callback (lambda (headers) ...)
             :body-callback (lambda (bytes) ...)
             :header-fields-case :capitalize)

Or making it possible to take a hash-table test function would be another option. I'm still thinking.

@deadtrickster
Copy link
Contributor Author

How :header-fields-case will work?
If it's just specifies how names are converted from byte arrays it's still probably not so good. E.g. if I'm using http client which set :header-fields-case parameter under the hood to :upcase then if I want to inspect headers by myself I'm forced to write my strings in upper case too.

I'm voting for hash-table solution

@orthecreedence
Copy link

Are custom hash test functions portable (as in, will support most CL implementations)? I've had bad luck, but admittedly didn't dive very deep. If they are portable, then I'm in favor of using a case-insensitive hash test for headers, otherwise specifying :lower (strongly encourage this as the default since most people are going to be typing lowercase by default), :upper, or :passthru (or something aptly named that says "use the same case the client handed me").

@deadtrickster
Copy link
Contributor Author

Maybe custom-hash-table source can help?

If it can't then :downcase or :lower as default and (probably) single option is ok. HTTP parser provides headers for API writers AND end-users e.g. HTTP parser -> HTTP server -> application developer. Either http server author and application developer can write names in their preferred case [and I don't see how it possible without hash-tables or on the fly case conversion which is bad] or they just use what http parser returns mandatory.

@dydra
Copy link

dydra commented Oct 31, 2014

on this topic, it could be worthwhile to look at the way cl-http manages headers. john mallory's concern about symbols as an attack vector was the motivation for cl-xml to allow for dynamically bound token tables an as alternative to keywords for generic identifiers.
my recollection is that cl-http had as a goal 0-consing header processing and to that end, did two things. first headers were available in a dynamic context only. second, know headers were all predefined and, as such their identifiers could be keywords while the others were resourced.
it might be worth reviewing.

@eugeneia
Copy link

eugeneia commented Jan 6, 2017

Out of curiosity, why were association lists not considered, or if they were, how is the overhead of a hash table justified?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants