Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENABLE_GZIP activated will result in tar.gz & tgz files compressed 2 times in gzip #30649

Closed
Daryes opened this issue Apr 22, 2024 · 15 comments
Closed
Assignees
Labels
issue/needs-feedback For bugs, we need more details. For features, the feature must be described in more detail issue/not-a-bug The reported issue is the intended behavior or the problem is not inside Gitea

Comments

@Daryes
Copy link

Daryes commented Apr 22, 2024

Description

Had to dig to understand what happened.
It started with retrieving a git repo in Gitea as an archive and having tar failing on it complaining it is not a tar archive.
Thing is, it was a gzip archive. Then after a gunzip, the .tar file was in fact a ... gzip file too.

Here are the shell commands to demonstrate this, but you can also use 7zip in a file browser, it will show an extra level with the archive name and no extension before having the first directory level.

$ wget https://gitea..../repo-master.tar.gz

$ tar xzvf repo-master.tar.gz
tar: This does not look like a tar archive
tar: Skipping to next header
tar: Exiting with failure status due to previous errors

$ file repo-master.tar.gz
repo-master.tar.gz: gzip compressed data, original size modulo 2^32 1008078

$ gunzip repo-master.tar.gz
$ ls
	repo-master.tar
$ file repo-master.tar
repo-master.tar: gzip compressed data, from Unix, original size modulo 2^32 6912000

$ tar xzvf repo-master.tar
(...)
Data uncompressed correctly

It went under the radar for a while as I rarely use this mechanism with the Gitea instance.

While digging in the configuration I found that I had the parameter ENABLE_GZIP = true
There is no additional parameter related to tar.*.command as stated in #26620 . Gitea has the default behavior here.

As sometimes there can be a problem due to the reverse proxy nginx in front wich also has gzip=on, I did some tests without it.
It was completely bypassed, connected directly to the gitea instance/port.

Results are :

  • zip files are correct in all situations
  • with ENABLE_GZIP = true, the downloaded tgz/tar.gz files are compressed in gzip a second time.
    It is worth noting the files generated by Gitea in its cache under repo-archive/*hash*.tar.gz are valid.
  • with ENABLE_GZIP = false the downloaded tgz/tar.gz files are valid.
    Included those who were already generated and still in Gitea local cache.

I also think ENABLE_GZIP does not do what is stated in the documentation : when set to false, without a reverse proxy most of the data was still sent gzipped. What is the real use for ?

Gitea Version

1.21.9 and 1.21.11

Can you reproduce the bug on the Gitea demo site?

No

Log Gist

No response

Screenshots

No response

Git Version

2.17.1

Operating System

Ubuntu 18.04

How are you running Gitea?

Binary release from https://github.com/go-gitea/gitea/releases
Launched by systemd

Database

SQLite

@wxiaoguang
Copy link
Contributor

  • with ENABLE_GZIP = true, the downloaded tgz/tar.gz files are compressed in gzip a second time.
    It is worth noting the files generated by Gitea in its cache under repo-archive/*hash*.tar.gz are valid.

If it is related to ENABLE_GZIP=true, could you try the 1.21 nightly (https://dl.gitea.com/gitea/1.21/) ? It has a new gzip handler.

I also think ENABLE_GZIP does not do what is stated in the documentation : when set to false, without a reverse proxy most of the data was still sent gzipped. What is the real use for ?

It doesn't seem to be possible. At least, according to the code, the "gzip handler" is only used when ENABLE_GZIP=true.

Could you provide more details about which page is sent as gzip when ENABLE_GZIP=false?

gitea/routers/web/web.go

Lines 246 to 254 in fcdc57d

if setting.EnableGzip {
// random jitter is recommended by: https://pkg.go.dev/github.com/klauspost/compress/gzhttp#readme-breach-mitigation
// compression level 6 is the gzip default and a good general tradeoff between speed, CPU usage, and compression
wrapper, err := gzhttp.NewWrapper(gzhttp.RandomJitter(32, 0, false), gzhttp.MinSize(GzipMinSize), gzhttp.CompressionLevel(6))
if err != nil {
log.Fatal("gzhttp.NewWrapper failed: %v", err)
}
mid = append(mid, wrapper)
}

@Daryes
Copy link
Author

Daryes commented Apr 24, 2024

Hello,

If it is related to ENABLE_GZIP=true, could you try the 1.21 nightly (https://dl.gitea.com/gitea/1.21/) ? It has a new gzip handler.

Tested with this binary : gitea-1.21-linux-amd64

  • time: 2024-04-24T00:34:10.000Z
  • version : 1.21.11+16-gddf64b84e4

Same behavior, no change : With ENABLE_GZIP,=true the .tar.gz files are compressed a second time.

It doesn't seem to be possible. At least, according to the code, the "gzip handler" is only used when ENABLE_GZIP=true.

Could you provide more details about which page is sent as gzip when ENABLE_GZIP=false?

Well, then, the plot thickens ...
Reran some tests, no reverse proxy, http connection, cache disabled in the browser, and "ENABLE_GZIP=false" in app.ini.

The following files are always gziped : webcomponents.js, index.js, index.css, logo.svg (the gitea one).
On the other hand, index.html page and anything .json related are not compressed, as expected.

The http request/response for index.css with ENABLE_GZIP=false

Http request :
GET		http://********:3000/assets/css/index.css?v=1.21.11
Status		200 OK
Version		HTTP/1.1
Transferred	87.40 kB (551.10 kB size)
Referrer Policy	no-referrer

Response headers (raw) :
HTTP/1.1 200 OK
Accept-Ranges: bytes
Cache-Control: private, max-age=21600
Content-Encoding: gzip
Content-Type: text/css; charset=utf-8
Last-Modified: Tue, 16 Apr 2024 03:46:00 GMT
Date: Wed, 24 Apr 2024 19:14:32 GMT
Transfer-Encoding: chunked

With ENABLE_GZIP=true, everything is compressed, images aside.
The problem with the .tar.gz files is helpful to validate the setting value.

Also, this might be the source of the problem for the double compression of the .tar.gz files : some mimetype exclusions could be missing in the code.

This aside, static resources or heavy files with an enforced compression do not strike me as a wrong idea. Just, yeah, seems some part of the documentation and code need a pass or 2.

@wxiaoguang
Copy link
Contributor

Thank you for the details. I will try to answer the questions by my understanding.

For the ENABLE_GZIP option itself:

Gitea has 2 different web request handlers.

  • One is for asset files (the static files you mentioned). These files are packed into Gitea's binary with pre-compression by vfsgen package. So they are not controlled by ENABLE_GZIP option, as long as the user's browser supports compress, these files are served as gzip content
  • Another kind of web request handlers is for "dynamic contents", like home page, repo page, etc, including the archive/attachment downloading, they could be affected by ENABLE_GZIP option

For the "tar.gz" problem itself, I did a quick test by these steps:

Use the docker compose:

version: "3"

services:
  server:
    image: gitea/gitea:1.21
    container_name: gitea
    environment:
      - GITEA__server__ENABLE_GZIP=true
    volumes:
      - ./gitea:/data
    ports:
      - "3002:3000"
      - "3022:22"

Test the gzip is really enabled:

$ curl -H 'Accept-encoding: gzip' -v http://localhost:3002/
...
< Content-Encoding: gzip
...
Warning: Binary output can mess up your terminal. Use "--output -" to tell
...

Create a repo named "test-archive", then download the tar.gz archive:

$ wget http://localhost:3002/root/test-archive/archive/main.tar.gz
.....
$ tar -xvf main.tar.gz
x test-archive/
x test-archive/README.md
$ cat test-archive/README.md
.....
.....

I think it works well.

@Daryes
Copy link
Author

Daryes commented Apr 25, 2024

I can confirm your test for a new repo, with a default readme.md : the tar.gz file is correct.
be it from wget or using a browser.

I can also confirm this behavior is not the norm. The trigger is having the repo to a minimum size.
There's also the fact I was running wget with some environment parameters activating gzip.

With your new repo test and ENABLE_GZIP=true :

$ git clone gitea@***:test/testing.git
$ cd testing

$ du -sh *
4.0K    README.md

$  wget https://raw.githubusercontent.com/go-gitea/gitea/main/README.md
(...)
$ du -sh *
4.0K    README.md
12K     README.md.1
$ git add .
$ git commit -m "test 1" && git push

# --------------------------------------------
$ wget -S http://***/test/testing/archive/master.tar.gz
  (...)
  Content-Disposition: attachment; filename="testing-master.tar.gz"; filename*=UTF-8''testing-master.tar.gz
  Last-Modified: Thu, 25 Apr 2024 21:25:37 GMT
  (...)
Length: 3251 (3.2K) [application/octet-stream]
Saving to: ‘master.tar.gz’

$ file master.tar.gz
master.tar.gz: gzip compressed data, from Unix
$ tar tvf master.tar.gz
file is ok
# --------------------------------------------
$ wget -S --header="accept-encoding: gzip" http://***/test/testing/archive/master.tar.gz
  (...)
  Content-Disposition: attachment; filename="testing-master.tar.gz"; filename*=UTF-8''testing-master.tar.gz
  Content-Encoding: gzip
  Last-Modified: Thu, 25 Apr 2024 21:25:37 GMT
  (...)
Length: unspecified [application/octet-stream]
Saving to: ‘master.tar.gz.1’

$ file master.tar.gz.1
master.tar.gz.1: gzip compressed data
$ tar tvf master.tar.gz.1
tar: This does not look like a tar archive

Please note the response headers are also not the same between gzip encoding activated or not
And using a browser will also return a double gzipped archive, due to the fact it will by default accepted gzip encoded data.

So to summarize :

  • ENABLE_GZIP=true set in Gitea configuration

  • the target repo must have a minimal size to trigger Gitea gzip encoder on http connections. The default readme.md from a new repo is clearly not enough.

  • wget in its default configuration will not support gzip encoded data and will not trigger the double compressed archive.

  • any client (browser, curl, wget) with a configuration where gzip encoding is explicitly accepted will trigger the double compressed archive

@wxiaoguang
Copy link
Contributor

Oops, my bad .... I forgot the minimal gzip threshold 😢 you are right, I will take a look again later.

@wxiaoguang wxiaoguang self-assigned this Apr 25, 2024
@wxiaoguang
Copy link
Contributor

wxiaoguang commented Apr 26, 2024

Tested again with my docker compose in #30649 (comment) .

This time, I uploaded a large avatar.png to the repo (170KB)

$ curl -H 'Accept-encoding: gzip' -v http://localhost:3002/
...
> Accept-encoding: gzip
>
< HTTP/1.1 200 OK
...
< Content-Encoding: gzip
...
Warning: Binary output can mess up your terminal. Use "--output -" to tell

$ wget  -S --header="accept-encoding: gzip"  http://localhost:3002/root/test-archive/archive/main.tar.gz
--2024-04-26 08:07:06--  http://localhost:3002/root/test-archive/archive/main.tar.gz
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:3002... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Accept-Ranges: bytes
  Access-Control-Expose-Headers: Content-Disposition
  Cache-Control: private, max-age=300
  Content-Disposition: attachment; filename="test-archive-main.tar.gz"; filename*=UTF-8''test-archive-main.tar.gz
  Content-Encoding: gzip
  Content-Type: application/octet-stream
  Last-Modified: Fri, 26 Apr 2024 00:07:07 GMT
  (... set cookie ...)
  Vary: Accept-Encoding
  X-Content-Type-Options: nosniff
  X-Frame-Options: SAMEORIGIN
  Date: Fri, 26 Apr 2024 00:07:07 GMT
  Transfer-Encoding: chunked
Length: unspecified [application/octet-stream]
Saving to: ‘main.tar.gz’

$ ls -lh main.tar.gz
-rw-r--r-- 1 xiaoguang staff 167K Apr 26 08:07 main.tar.gz

$ tar -xvf main.tar.gz
x test-archive/
x test-archive/README.md
x test-archive/avatar.png

$ gunzip main.tar.gz
$ mv main.tar main.tar.gz && gunzip main.tar.gz
(now get the original tar ....)

It seems that my tar is too smart, it is able to un-archive the double-gzipped file .......

Outdated, see below.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Apr 26, 2024

After more testing, I think the problem is clear now. I think the gzip compression is not wrong. There are some cases:

  1. Download by browser (or curl with transparent compression), browser will respect the Content-Encoding: gzip header, and decompress the received content automatically, and write the decompressed data into file.
    • Download "zip": the transferred data is compressed by gzip, and browser decompress the gzip content, write the original zip file
    • Download "tar.gz": the transferred data is compressed by gzip (again), and browser decompress the transferred gzip content, write the original "tar.gz" file content
  2. Download by wget, since you have set --header="accept-encoding: gzip manually, wget won't decompress the transferred content automatically. It only writes what it received into file.
    • Download "zip": the transferred data is compressed by gzip, wget writes the received gzip content into file
    • Download "tar.gz", the transferred data is compressed by gzip (again), wget writes the received double-gzipped content into file
$ file test-repo-main.zip (downloaded by Chrome)
test-repo-main.zip: Zip archive data, ....

$ file main.zip (downloaded by wget --header="accept-encoding: gzip")
main.zip: gzip compressed data, ....

About this:

any client (browser, curl, wget) with a configuration where gzip encoding is explicitly accepted will trigger the double compressed archive

That's expected and correct behavior. If you would like to make "curl" work with "gzip" transparently, it should use curl --compressed. https://stackoverflow.com/questions/18983719/is-there-any-way-to-get-curl-to-decompress-a-response-without-sending-the-accept

See the differences by:

$ curl -v -H "accept-encoding: gzip" -O http://localhost:3002/root/test-archive/archive/main.zip
(get a gzipped zip file)

$ curl -v --compressed -O http://localhost:3002/root/test-archive/archive/main.zip
(get a correct zip file, just like browsers)

@wxiaoguang wxiaoguang added issue/not-a-bug The reported issue is the intended behavior or the problem is not inside Gitea issue/needs-feedback For bugs, we need more details. For features, the feature must be described in more detail and removed type/bug labels Apr 26, 2024
@wxiaoguang
Copy link
Contributor

So I think this issue could be closed?

@Daryes
Copy link
Author

Daryes commented Apr 30, 2024

No.
Not for me anyway.

I've said it before, the browser is also not handling the tar.gz properly. As its true I didn't specify the version, here it is : Firefox ESR 115.10 (latest).
Btw, I tend to favor tar.gz files from repos as they will manage correctly symlinks and other *nix special files that zip sometimes cannot, and will not warn about.

This said, with ENABLE_GZIP on, currently, wget and firefox do not save a valid tar.gz file.
One client in error, could be it, 2 different clients ? Might be server side instead.
Never before I had to specify --compress explicitly in either curl or wget to download an archive from a webserver. If the content is wrong, it is then either badly handled by the webserver, or sent without the correct information.

What could be the source of the problem seems to circle around the response header Content-Type sent by Gitea.
It is Content-Type: application/octet-stream for zip AND tar.gz archives.
Which is wrong for both if they must be managed by the client.
While it is usually used to force the client to save the data in a file as per the RFC 2046, given it is an arbitrary data, it leaves to the client how it will be handled.
Now, combine those 3 headers sent to the client

Content-Disposition: attachment; filename="testing-master.tar.gz"; filename*=UTF-8''testing-master.tar.gz
Content-Encoding: gzip
Content-Type: application/octet-stream

How it will be handled will depend of the client, as the type octet-stream has no real mandatory definition about how to managed it, aside being saved as a file (mind the fact this is also just a recommendation originally in said RFC).

Either the client is blind and applies the content-encoding, stating gzip => uncompress the content.
Or it tries to be smart due to so much bad implementation or configuration since decades, and with an arbitrary data type, will seek the filename information in the content-disposition header.

  • Filename is zip ? encoding is gzip => uncompress then save
  • Filename is gz ? encoding is gzip => save as is, because it is an octet stream. Which can also be interpreted as raw data.

This scenario would explain the situation.
It could be validated with a proxy and force-rewriting the content-type header. Not hard, but already too much work for this.


I did much more digging on this than intended, for something that was in the end not correctly implemented since start, from my point of view.
I would like to remind that activating deflate or zip on a webserver is done using mimetype to ensure already compressed data (archive, images, crypted files, ...) is not compressed another time.
It is a waste of cpu cycles, and gzipping everything without regard to the data nor the size tends to give the opposite effect from what was desired.

The parameter ENABLE_GZIP should be either set to false as default (which is the current, so good already) but also masked/removed from the documentation, leaving the "full" compression to a reverse proxy like nginx or traefik.
Or keeping it but after a rewrite to manage exclusions, for properly compressing or leaving untouched the data depending of the real type.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented May 1, 2024

Thank you for more feedbacks. Let's do a quick test:

This test is still based on my provided docker compose, with a large file in it:

  1. wget -O test-no-gzip.zip http://localhost:3002/root/test-gzip/archive/main.zip
    • test-no-gzip.zip: Zip archive data, at least v1.0 to extract, compression method=store
  2. wget -O test-with-gzip.zip --header="accept-encoding: gzip" http://localhost:3002/root/test-gzip/archive/main.zip
    • test-with-gzip.zip: gzip compressed data, original size modulo 2^32 170400
    • If accept-encoding: gzip is set, then the saved content should be gzipped again, because client shouldn't transparently decompress the transferred content.
  3. Download by Firefox 125.0.2 (64-bit)
    • Downloads/test-gzip-main.zip: Zip archive data, at least v1.0 to extract, compression method=store
    • image
  4. Download the "tar.gz" by Firefox
    • gunzip test-gzip-main.tar.gz && file test-gzip-main.tar
    • test-gzip-main.tar: POSIX tar archive
    • It isn't double-gzipped

Conclusion:

  • wget/Firefox always run correctly for this case.
  • Manually setting "accept-encoding: gzip" is not the correct method in daily usage, it prevent the client from transparently handling the transport encoding.

About your question:

Content-Disposition: attachment; filename="testing-master.tar.gz"; filename*=UTF-8''testing-master.tar.gz
Content-Encoding: gzip
Content-Type: application/octet-stream

These headers are right (at least, not wrong). Content-Encoding is for transporting, it should be handled transparently by the clients. After the transport is correctly handled and client gets "processed content (decompressed)", then the client could use Content-Disposition to know to write the processed content into a file, and the content type is application/octet-stream (for downloading, the content type doesn't really matter)


I also agree that when transferring some compressed contents, using gzip to compress the data on wire again is not ideal (and unnecessary), while indeed it won't (and shouldn't) cause any real problem at the moment. I think the behavior could be improved if the 3rd gzip handler could be more smarter when it detects if the content is compressed .......... and maybe it could also hard-code something on Gitea side.

@wxiaoguang
Copy link
Contributor

and maybe it could also hard-code something on Gitea side.

As a quick fix, it could be like this: Skip gzip for some well-known compressed file types #30796

@wxiaoguang
Copy link
Contributor

#30796 and its backport have been merged. Feel free to try 1.22-nightly

@wxiaoguang
Copy link
Contributor

Are there still any problems? Or maybe this issue could be closed?

@Daryes
Copy link
Author

Daryes commented May 6, 2024

I didn't test it yet, but I don't see why it wouldn't work given the changes you did.
Btw, I would like to thank you for your work, I didn't expect it would be resolved.

@Daryes
Copy link
Author

Daryes commented May 12, 2024

Just tested, working fine with already compressed files while retaining the gzip compression for the other files.
Thanks

@Daryes Daryes closed this as completed May 12, 2024
@go-gitea go-gitea locked as resolved and limited conversation to collaborators Aug 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
issue/needs-feedback For bugs, we need more details. For features, the feature must be described in more detail issue/not-a-bug The reported issue is the intended behavior or the problem is not inside Gitea
Projects
None yet
Development

No branches or pull requests

2 participants