Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

application/octet-stream data detected as "text/plain; charset=utf-8" since v1.3.0 #186

Closed
anthonyfok opened this issue Sep 13, 2021 · 1 comment
Assignees

Comments

@anthonyfok
Copy link

Attach the file for which the detection is inaccurate
Please see the attached test-file-from-shared_test_go_in_github-cli.txt file, extracted from https://github.com/cli/cli/blob/09b09810dd812e3ede54b59ad9d6912b946ac6c5/pkg/cmd/gist/shared/shared_test.go#L54-L87
(I wanted to use the .bin extension for binary data, but GitHub won't let me...)

Expected MIME type
application/octet-stream
(as was detected by v1.2.0 and before)

Returned MIME type
text/plain; charset=utf-8

Version of the library you are using
v1.3.1 (and v1.3.0)

Output of go version
go version go1.16.8 linux/amd64

Additional context

While packaging github.com/cli/cli (GitHub CLI gh), of which github.com/gabriel-vasile/mimetype is a dependency, I ran into the following error:

=== RUN   TestIsBinaryContents
    shared_test.go:85: 
                Error Trace:    shared_test.go:85
                Error:          Not equal: 
                                expected: true
                                actual  : false
                Test:           TestIsBinaryContents
--- FAIL: TestIsBinaryContents (0.00s)

It turns out that the go.mod of gh currently has github.com/gabriel-vasile/mimetype pinned at v1.1.2, while I packaged v1.3.1 for Debian: https://ftp-master.debian.org/new/golang-github-gabriel-vasile-mimetype_1.3.1-1.html

I originally tested with the following test main.go:

package main

import (
	"fmt"

	"github.com/gabriel-vasile/mimetype"
)

var fileContent = []byte{239, 191, 189, 239, 191, 189, 239, 191, 189, 239,
	191, 189, 239, 191, 189, 16, 74, 70, 73, 70, 239, 191, 189, 1, 1, 1,
	1, 44, 1, 44, 239, 191, 189, 239, 191, 189, 239, 191, 189, 239, 191,
	189, 239, 191, 189, 67, 239, 191, 189, 8, 6, 6, 7, 6, 5, 8, 7, 7, 7,
	9, 9, 8, 10, 12, 20, 10, 12, 11, 11, 12, 25, 18, 19, 15, 20, 29, 26,
	31, 30, 29, 26, 28, 28, 32, 36, 46, 39, 32, 34, 44, 35, 28, 28, 40,
	55, 41, 44, 48, 49, 52, 52, 52, 31, 39, 57, 61, 56, 50, 60, 46, 51,
	52, 50, 239, 191, 189, 239, 191, 189, 239, 191, 189, 67, 1, 9, 9, 9, 12}

func main() {
	mtype := mimetype.Detect(fileContent)
	fmt.Println(mtype)
}

with the following commands:

$ go get github.com/gabriel-vasile/mimetype@v1.2.0
$ go run main.go 
application/octet-stream
$ go get github.com/gabriel-vasile/mimetype@v1.3.0
go get: upgraded github.com/gabriel-vasile/mimetype v1.2.0 => v1.3.0
$ go run main.go 
text/plain; charset=utf-8
$ go get github.com/gabriel-vasile/mimetype@v1.3.1
go get: upgraded github.com/gabriel-vasile/mimetype v1.3.0 => v1.3.1
$ go run main.go 
text/plain; charset=utf-8

Testing the same data with file:

$ hd test-file-from-shared_test_go_in_github-cli.txt 
00000000  ef bf bd ef bf bd ef bf  bd ef bf bd ef bf bd 10  |................|
00000010  4a 46 49 46 ef bf bd 01  01 01 01 2c 01 2c ef bf  |JFIF.......,.,..|
00000020  bd ef bf bd ef bf bd ef  bf bd ef bf bd 43 ef bf  |.............C..|
00000030  bd 08 06 06 07 06 05 08  07 07 07 09 09 08 0a 0c  |................|
00000040  14 0a 0c 0b 0b 0c 19 12  13 0f 14 1d 1a 1f 1e 1d  |................|
00000050  1a 1c 1c 20 24 2e 27 20  22 2c 23 1c 1c 28 37 29  |... $.' ",#..(7)|
00000060  2c 30 31 34 34 34 1f 27  39 3d 38 32 3c 2e 33 34  |,01444.'9=82<.34|
00000070  32 ef bf bd ef bf bd ef  bf bd 43 01 09 09 09 0c  |2.........C.....|
00000080
$ file test-file-from-shared_test_go_in_github-cli.txt 
test-file-from-shared_test_go_in_github-cli.txt: data

I believe this is a case where the test data happens to be valid UTF-8, but it "uses odd control characters, so doesn't look like text" according to https://github.com/file/file/blob/master/src/encoding.c; see https://github.com/file/file/blob/360436c8502150b0ef99d6ffd5683463804d863a/src/encoding.c#L350-L370

Thanks in advance!

@gabriel-vasile
Copy link
Owner

Thank you for the detailed issue.
v1.2.0 uses the mimesniff definition of binary data bytes;
v1.3.0 uses the more permissive definition: byte is binary if byte is null 0x00;

Compared to file/file, mimesniff definition excludes 3 more control characters: bell, backspace, vertical tab.

I guess the solution to this is to return to v1.2.0 behaviour.

@gabriel-vasile gabriel-vasile self-assigned this Sep 16, 2021
gabriel-vasile added a commit that referenced this issue Sep 16, 2021
This commit returns to pre v1.2.0 definition of binary data, taken from
mimesniff standard. For #186.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants