Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upload-product is not efficient #196

Closed
fredwangwang opened this issue Jun 28, 2018 · 22 comments
Closed

upload-product is not efficient #196

fredwangwang opened this issue Jun 28, 2018 · 22 comments

Comments

@fredwangwang
Copy link
Member

Noticed this yesterday, when I ran upload-product --product pas.pivotal, om spend a long time saying processing product. Eventually found that it writes the content payload to the temp disk. This makes sense because we never want to load a 9G file into the memory, but it is not efficient. And after the upload process finishes, it will keep the payload in the /tmp folder. Although eventually it will be cleaned up by the operating system, it occupies a lot of the space.

So I tried to find is there a way to multipart upload big file without consuming much disk or memory, but failed to find a good library to do that. Then I just tried to write a library, with the aim to plugin into om as easy as possible.

Then I come up with this:
https://github.com/fredwangwang/formcontent/

It loads the content of the file from the disk on demand, so it doesn't eat additional disk or consume much memory.

And I write there are two versions in the repository, one is threaded (master branch), another is non-threaded (non-threaded branch). The non-threaded branch works better, but the code clearity is worse, as it basically implement a cheap state machine in a custom Read method.

@cf-gitbot
Copy link
Member

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

@zachgersh
Copy link
Contributor

zachgersh commented Jun 28, 2018 via email

@fredwangwang
Copy link
Member Author

It can be as generic as a standalone library, I guess? I didn't find a good enough library, so I just want to write it

@mcwumbly
Copy link
Contributor

mcwumbly commented Jul 4, 2018

@ljfranklin curious for your thoughts on this one since I know you've spent time in the past looking at improvements to upload speeds

@ljfranklin
Copy link
Contributor

ljfranklin commented Jul 11, 2018

Definitely something we'd like to investigate further. At minimum we shouldn't be leaving 9GB temp files after exit. I haven't looked at our code closely but I wonder if it's related to this.

Extra disk space aside, are you seeing any speed up in upload time with your library?

@fredwangwang
Copy link
Member Author

but I wonder if it's related to this.

that is interesting, but I believe it is a separate issue. As type Form struct is used by the reader, which is the server side. But this issue is specificly about uploading multipart file in the client side.

any speed up in upload time with your library?

Given assumption: 9G file, rotational disk (W:~150MB/s), it takes about one minute just to write the tempfile to disk. And all that time it shows as processing product.

The library wipes that minute.

@ljfranklin
Copy link
Contributor

@fredwangwang could you run time om upload-product with and without your change? Sorry if I'm coming across as overly skeptical but I'm always wary of performance optimizations without seeing the numbers. For example, is the tempfile actually 9GB or is it a smaller tempfile that's used as a temporary buffer?

@fredwangwang
Copy link
Member Author

@ljfranklin

could you run time om upload-product with and without your change

There is no point to time it... As if uploading a 9G file, network bandwidth fluctuation would be the primary factor cauzing the time difference. The libaray shortens the total time by something around one minute given the exact same network situation, thats all.

For uploading a smaller file (1G for example), writing to the disk doesn't take that long anyways, then the (noticable, mainly time factor) difference of the library is not obvious.

For example, is the tempfile actually 9GB or is it a smaller tempfile that's used as a temporary buffer?

If you dig the source code, it is writing the entire multipart payload into a tempfile, which including file_to_upload + form_field + form_boundary_sep. Which is even larger then the file to be uploaded, by somewhat around 1KB (so mainly, the size of the file itself).


if you really want to experience the improvement of the library without touching the source cod: Go to a upload-product task, watch it run. Time how long it spends on processing product. Imagine that time becomes 100ms. That's the optimization.

@ljfranklin
Copy link
Contributor

@fredwangwang confirmed that your implementation does speed things up:

Current om code (22 seconds to process w/SSD):

om upload-product -p ~/Downloads/srt-2.2.0-build.104.pivotal | ts
Jul 13 08:56:06 processing product
Jul 13 08:56:28 beginning product upload to Ops Manager
 56.42 MiB / 9.44 GiB [>-----------------------------------------]   0.58% 9m44s

Your code (<1 second processing time):

go run main.go upload-product -p ~/Downloads/srt-2.2.0-build.104.pivotal | ts
Jul 13 08:55:52 processing product
Jul 13 08:55:52 beginning product upload to Ops Manager
 70.07 MiB / 9.44 GiB [>-----------------------------------------]   0.72% 9m33s

For your implementation, I'm still hoping we can have the stdlib do the heavy lifting and not have to deal with the low level boundary calculations. Could we instead use an approach similar to this to have an io.Pipe take the place of the tempfile in the original implementation? I haven't gotten a chance to try it out, but I'm thinking we could have AddFile and AddField just record their args in a []Parts, then Finalize() spins out a go routine to write all the parts into the pipe. Still might have to do a bit of math to calculate the content length, but still feels like it would simplify things.

@fredwangwang
Copy link
Member Author

@ljfranklin

I have that here: https://github.com/fredwangwang/formcontent/tree/threaded

That was my first attempt (using io.pipe). See my original post, that was actually my master branch before. But I want it to be single threaded (for fun...hh). And single threaded version actually has a smaller memory footage (about .5 MB smaller, as I would guess io.pipe using 512Bytes to hold the buffer).

So you could take a look that, the code does look simpler

@ljfranklin
Copy link
Contributor

@fredwangwang Let's cross-team pair on getting this merged in. I'm liking the direction the threaded branch is going in, but I don't understand why we have a multipart.Writer just for fields and additional multipart.Writers for each file. I was expecting a single multipart.Writer which gets passed the writing end of the io.Pipe. Maybe AddFile just records its args:

func (f *Form) AddField(key string, value string) error {
  f.parts = append(f.parts, part{
    Key: key,
    Value: value,
    Type: "field",
  })
  return nil
}
func (f *Form) AddFile(key string, path string) error {
  f.parts = append(f.parts, part{
    Key: key,
    Value: path,
    Type: "file",
  })
  return nil
}

Then in Finalize() we can loop over all the parts in a go routine and write each part into the multipart.Writer. This keeps the boundary math to a minimum, although we still might need a bit of it to calculate the final content length.

@fredwangwang
Copy link
Member Author

@ljfranklin yeah i do think your solution is cleaner. The reason I did it that way is I don't want to create additional structs to hold all the different properties, so the form fields can be stored in a single form.formfields and each file has an entry in files and filekeys ("pointlessful" justification...)

@ljfranklin
Copy link
Contributor

Moving convo from slack back here:

Do you want a one writer approach rather than the multi temp writer approach I have now?

I prefer the one writer approach just because it's easier for me to wrap my head around and matches most closely with the canonical multipart upload examples listed in the godocs. My main goal is to minimize the amount of low-level boundary calculations we have to do to keep the code as maintainable as possible, e.g. if separate && f.formFields.Len() > len(f.boundary)+8.

@ljfranklin
Copy link
Contributor

Paired for a bit with @fredwangwang and sounds like we're in sync about the potential implementation details. He's planning on submitting what he has as a PR and we'll try to get it merged in.

@ljfranklin
Copy link
Contributor

Merged #211 to fix this. Thanks for working through this tricky issue @fredwangwang!

@mcwumbly
Copy link
Contributor

I tried the following and don't see an obvious improvement. Is my test case valid?

New version (92ea225) (1m 59s):

dm $ time om -k upload-product --product ~/Downloads/p-isolation-segment-2.3.0-build.182.pivotal
processing product
beginning product upload to Ops Manager
 1.90 GiB / 1.90 GiB [============================================] 100.00% 1m3s
54s elapsed, waiting for response from Ops Manager...
finished upload

real	1m59.051s
user	0m15.974s
sys	0m24.836s

Old version (0.38.0) (1m 47s):

dm $ time om -k upload-product --product ~/Downloads/p-isolation-segment-2.3.0-build.182.pivotal
processing product
beginning product upload to Ops Manager
 1.90 GiB / 1.90 GiB [============================================] 100.00% 1m3s
38s elapsed, waiting for response from Ops Manager...
finished upload

real	1m47.667s
user	0m11.840s
sys	0m23.214s

oh... looks like I should just be paying attention to the time spent in "processing product"?

@fredwangwang
Copy link
Member Author

yep, the improvement would not speed up network connection (I hope I can...), its about reducing the processing product time and the temp disk occupied by the command

@mcwumbly
Copy link
Contributor

Get ts:

dm $ brew install moreutils

How big is this thing? (1.9G)

dm $ ls -lh ~/Downloads/p-isolation-segment-2.3.0-build.182.pivotal
-rw-r--r--@ 1 pivotal  staff   1.9G Jul 25 11:32 /Users/pivotal/Downloads/p-isolation-segment-2.3.0-build.182.pivotal

Before (8s)

dm $ om -k upload-product --product ~/Downloads/p-isolation-segment-2.3.0-build.182.pivotal | ts
Jul 25 11:49:04 processing product
Jul 25 11:49:12 beginning product upload to Ops Manager
...

After (0s)

dm $ om -k upload-product --product ~/Downloads/p-isolation-segment-2.3.0-build.182.pivotal | ts
Jul 25 11:51:38 processing product
Jul 25 11:51:38 beginning product upload to Ops Manager
...

@fredwangwang
Copy link
Member Author

fredwangwang commented Jul 25, 2018 via email

fredwangwang added a commit that referenced this issue Aug 20, 2018
related to issue #196, I created the formcontent in my repo, now there is a
request to move the code into a pivotal repo to remove the dependency on a
personal repo.

[#159835480]
@fredwangwang
Copy link
Member Author

Had a branch to move formcontent into om. Could you review and merge it back to master?

@fredwangwang fredwangwang reopened this Aug 20, 2018
@cf-gitbot
Copy link
Member

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

fredwangwang added a commit that referenced this issue Oct 15, 2018
related to issue #196, I created the formcontent in my repo, now there is a
request to move the code into a pivotal repo to remove the dependency on a
personal repo.

[#159835480]
@fredwangwang
Copy link
Member Author

merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants