Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetching a binary file from http & sending it => corruption #1375

Closed
emmanueltouzery opened this issue Mar 27, 2020 · 5 comments
Closed

fetching a binary file from http & sending it => corruption #1375

emmanueltouzery opened this issue Mar 27, 2020 · 5 comments
Labels

Comments

@emmanueltouzery
Copy link

emmanueltouzery commented Mar 27, 2020

Environment

  • k6 version: 0.26.1
  • OS and version: linux, fedora 31
  • Docker version and image, if applicable: -

Expected Behavior

I am fetching a binary file from HTTP in the setup(), I print the size of the binary in the setup and in the VU function. In my real program of course I want to send the binary over HTTP.

Actual Behavior

I would expect the length of the binary that I fetched to be the same in the setup and in the VU, but it's not. The binary is garbled:

INFO[0000] whew.png body size: 3399
INFO[0000] body size is: 3302

The first line is from the setup, the second from the VU sender.

The discrepancy is quite a lot worse if I don't use base64 encode/decode (in that case the size in the VU sender is about twice larger as before).

Steps to Reproduce the Problem

I have this test code:

import http from "k6/http";
import { sleep } from "k6";
import encoding from "k6/encoding";

function getFile(path) {
    // replace this IP with the IP of your machine
    const body = http.get(`http://192.168.178.76:8000/${path}`, {
        responseType: "binary"
    }).body;
    console.log(`${path} body size: ${body.length}`);
    return encoding.b64encode(body);
}

export function setup() {
    // put the name of a binary file in the current folder
    return getFile(`whew.png`);
}

export default function(data) {
    console.log("body size is: " + encoding.b64decode(data).length);
    sleep(25);
}

I used python to serve files from the local folder:

python2 -m SimpleHTTPServer 8000 

As an aside, the reason I'm fetching the files from HTTP is that I have lots of files to send, different for each VU. If I fetch the files in the per-VU init code (as I think I'm meant to do), I don't have the __VU there (I'm getting __VU is not defined). So I'd have to fetch the files for all the VUs in each VU init code, which would be way too much: I have about 250Mb of data for all VUs together, and 2000 VUs -- if each VU did fetch all the data, I'd load 250Mb*2000. So what I tried to do is to load the data for all the VUs together just once, in the setup. But now I'm hitting this issue.

EDIT If I make http read+write in the VU sending function (not using the setup) then it works:
INFO[0000] whew.png body size: 3399
INFO[0000] body size is: 3399

@imiric
Copy link
Contributor

imiric commented Mar 27, 2020

Hi,

the difference you're seeing here is because of a type difference: body in this case is a raw binary array whereas b64decode() returns a string, so their .length will be different, even though the data is the same. You can confirm this by console.log-ing the base64 string and decoding it manually (e.g. base64 --decode < enc.b64 > dec.png) and you'll see the image is not corrupted.

This behavior is part of the discrepancies in how k6 handles binary data. See issue #1020 for details. Ideally both body and the value returned by b64decode() would be of the same type and actually be usable, and there wouldn't be a difference in their .length values.

But to address your use case, even with this binary issue aside, currently you wouldn't be able to achieve the memory savings you expect by loading all data in setup() once, since that data is passed to each VU, so if you load and return 250Mb from setup() you'd still need 250Mb*$K6_VUS amount of memory during the test.

One workaround you can consider is manually splitting the data for each VU, as suggested here. Since you're not dealing with JSON, you would need to request only images for each specific VU, but that pattern would probably work for you.

Note that sharing setup data efficiently across VUs has already been discussed and planned (see #532), and with the upcoming #1007/#997 distributed execution changes this kind of setup will be easier and more efficient.

I'll close this as these are known issues, but let us know if you have additional questions, and for further support you're welcome to use the community forum.

@imiric imiric closed this as completed Mar 27, 2020
@emmanueltouzery
Copy link
Author

emmanueltouzery commented Mar 27, 2020

No you misread, the length difference is not because of base 64. I do decode, I'm pretty sure there is in fact a bug. Thank you for the tips -- now I'm preparing my data by fetching it in the vu loop (I could also fetch it just in the first iteration). But the bug i described does stand.

@emmanueltouzery
Copy link
Author

corr
here is an example of the corruption

@na-- na-- reopened this Mar 27, 2020
@na-- na-- added the evaluation needed proposal needs to be validated or tested before fully implementing it in k6 label Mar 27, 2020
@mstoykov
Copy link
Contributor

After ... too much digging (and reading your screenshot backwards, which lost sometime), the problem for the particular case of b64encode->b64decode given different data is the combination of that b64decode returns string and how goja (the JS VM k6 uses) works with strings ...

The short of it is that if b64decode returned string instead of []byte ( even though b64encode takes a []byte). If it was returning a []byte it would've worked, but unfortunately, that is (probably) a breaking change.

A possible workaround that I found is to JSON.stringify(data) and then JSON.parse(stringified). This works for the first three lines of bytes in the screenshot above, don't know if it works for all ;).
Again in your case this will NOT save you any amount of memory because the setup data is copied to all VUs ... actually, because you will need to decode it, it will use even more memory :).

Longer explanation :D (I have probably gotten something wrong, but it seems like my conclusions agree with the experiments I have done)

JS uses utf16(kind of 😑 ) for it's strings. K6 is written in golang which uses utf8 for it's strings.

My knowledge on the matter is not much, but the important fact is that the byte representation of a character in the one doesn't match the other. So the Goja VM translates non ASCII only strings(fun fact ... UTF-8 ASCII is not the same as UTF-16 ASCII so no idea why :D) from k6's internal UTF-8 to UTF-16 when k6 returns a string to the JS VM and does it back around when a string from goja goes to k6(a little bit more complicated but ... close enough).

Now if we look at this code in the golang playground:

package main

import (
	"fmt"
	"reflect"
	"unicode/utf16"

	"github.com/davecgh/go-spew/spew"
)

func main() {
	b := []byte{
		0x50, 0x4b, 0x03, 0x04, 0x14, 0x00, 0x08, 0x08, 0x08, 0x00, 0x8a, 0x81, 0x7a, 0x50, 0x00, 0x00,
		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x2a, 0x00, 0x00, 0x00, 0x54, 0x52,
		0x41, 0x4e, 0x5f, 0x56, 0x41, 0x4c, 0x5f, 0x42, 0x41, 0x4b, 0x55, 0x56, 0x41, 0x4c, 0x30, 0x30,
	}

	s := string(b)
	utf16s := utf16.Encode([]rune(s))
	utf8s := utf16.Decode(utf16s)

	fmt.Println(reflect.DeepEqual([]byte(s), b))
	fmt.Println(s == string(utf8s))
	spew.Dump([]byte(s))
	spew.Dump([]byte(string(utf8s)))
}
// output:
true
false
([]uint8) (len=48 cap=48) {
 00000000  50 4b 03 04 14 00 08 08  08 00 8a 81 7a 50 00 00  |PK..........zP..|
 00000010  00 00 00 00 00 00 00 00  00 00 2a 00 00 00 54 52  |..........*...TR|
 00000020  41 4e 5f 56 41 4c 5f 42  41 4b 55 56 41 4c 30 30  |AN_VAL_BAKUVAL00|
}
([]uint8) (len=52 cap=64) {
 00000000  50 4b 03 04 14 00 08 08  08 00 ef bf bd ef bf bd  |PK..............|
 00000010  7a 50 00 00 00 00 00 00  00 00 00 00 00 00 2a 00  |zP............*.|
 00000020  00 00 54 52 41 4e 5f 56  41 4c 5f 42 41 4b 55 56  |..TRAN_VAL_BAKUV|
 00000030  41 4c 30 30                                       |AL00|
}

Program exited.

we can see that going to UTF-16 from UTF-8 and back isn't lossless in this case. My gut feeling is that some of those are not actually UTF-8 valid and string([]byte{...}) doesn't do any checks or fixes to this just copies data bytes blindly, but the Encode/Decode from UTF-16 does :)

I would argue the exact reason is not important as this is clearly not how binary data should be handled in k6 and this should just be fixed by using typed arrays for []byte and so on.

Additionally, the k6 b64decode returns string, which definitely makes the whole thing look more and more like utf16.Encode skips/tries to fix what it doesn't understand from the supposedly UTF-8 encoded string that b64decode returns.

I would argue b64decode should have either always returned []byte or should have had mode to do that, but I am not certain it can now be worked around, and I would argue this should happen after #1020 (or as part of it :D).

Another issue found along the way is that because the data returned from setup is encoded as JSON using the golang's json package []byte arrays get b64encode ... but not decoded when we put it back in each VU, as they are strings. The goja JSON implementation does the correct thing ™️ and marshals it to an array of ints which is why the workaround above works.

This should be fixed, but unfortunately, the internal goja JSON implementation is not exported ... there is Object.MarshalJSON, but in the code we have a ... goja.Value so I'm pretty sure it will take more then two lines :(

@na-- na-- removed the evaluation needed proposal needs to be validated or tested before fully implementing it in k6 label Mar 30, 2020
@mstoykov
Copy link
Contributor

I think that this should be fixed with a4927b6#diff-787f834ad3403248052890ea97f946bffc88d39d2821b3157b22451081c7c393, so I am closing it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants