Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add line iterator to StreamingBody #1195

Merged
merged 5 commits into from
Jun 27, 2018
Merged

Conversation

sujaymansingh
Copy link
Contributor

This way, we can readlines from a streaming body (without having to load
all bytes into memory!)

This will allow us to use StreamingBody in places where a Python
file-like object is expected.
(E.g. csv.reader!)

This way, we can readlines from a streaming body (without having to load
all bytes into memory!)

This will allow us to use StreamingBody in places where a Python
file-like object is expected.
(E.g. csv.reader!)
@codecov-io
Copy link

codecov-io commented May 15, 2017

Codecov Report

Merging #1195 into develop will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1195      +/-   ##
===========================================
+ Coverage    98.01%   98.02%   +<.01%     
===========================================
  Files           45       45              
  Lines         7306     7328      +22     
===========================================
+ Hits          7161     7183      +22     
  Misses         145      145
Impacted Files Coverage Δ
botocore/response.py 92.06% <100%> (+4.25%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e9cca4a...6866ab7. Read the comment docs.

This was not dealing well when the chunk ended on a newline
There is no need for a stop iteration in a generator, we just don't
yield anything more!
In particular, catch the case where a chunk ends on a line boundary. We
can do this by ensuring that no matter what the chunk size, we still
end up with the same lines.
default_chunk_size = 1024
return self.iter_lines(default_chunk_size)

def iter_lines(self, chunk_size):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer to have this as a private method, normal file objects do not have an iter_lines method.

Copy link
Contributor Author

@sujaymansingh sujaymansingh Jun 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dstufft. I can change it to _iter_lines. Is that ok with you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that seems fine to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dstufft, I've made the change in @6866ab7
Hope it helps!

@dstufft
Copy link
Contributor

dstufft commented May 24, 2017

Overall this looks good to me, just one small comment.

@mbeacom
Copy link

mbeacom commented Jun 15, 2017

Is this mergeable with the requested changes in? I think it'd be a great addition until the IOBase/RawIOBase possibilities are explored and tested further.

@mcrowson
Copy link

Looks like the comment was addressed? There is a group of us really excited for this PR 💯

@mattrasband
Copy link

mattrasband commented Sep 7, 2017

@dstufft Would it be best to use a bytearray here instead of creating a bunch of new bytes objects? For larger files (as common in s3) it seems like this could cause a bit of resource thrashing.

edit: From http://ze.phyr.us/bytearray/ here is a timeit and writeup on the mutable vs immutable sequences used for buffers:

In [1]: %%timeit x = b''
x += b'x'
   ...:
100000 loops, best of 3: 3.02 µs per loop

In [2]: %%timeit x = bytearray()
x.extend(b'x')
   ...:
10000000 loops, best of 3: 152 ns per loop

@stealthycoin stealthycoin added enhancement This issue requests an improvement to a current feature. medium labels Sep 14, 2017
@sujaymansingh
Copy link
Contributor Author

@mrasband @dstufft Is the bytearray performance the only concern?

@mattrasband
Copy link

@sujaymansingh I don't have any authority to approve/merge - I stumbled on the issue as I wrote my own workaround a while ago and wanted to contribute it back (you already had). Eagerly awaiting this myself.

@ranman
Copy link

ranman commented Feb 28, 2018

This would be super cool to have for all streaming response types

@sburns
Copy link

sburns commented May 1, 2018

What will it take to get this merged?

@W-Ely
Copy link

W-Ely commented May 4, 2018

Do it!

@ranman
Copy link

ranman commented May 4, 2018

I would still love to see this!

@iolairus
Copy link

iolairus commented Jun 5, 2018

Looking forward to this one, +1

@seddy
Copy link

seddy commented Jun 5, 2018

Any chance on getting this merged? It would be really useful for us!

@smferguson
Copy link

+1. it'd be great to have this in!

@billcrook
Copy link

+1

@sujaymansingh
Copy link
Contributor Author

@dstufft @mbeacom What should we do about this? At this point it's been over a year since this PR was raised. Presumably there have been billions of changes upstream.

I could rebase or branch off again, but it's only worth doing that if

  • there is still a need/desire for this functionality from people
  • it's something that is likely to be merged (it'd be terrible to have to wait another year :))

Would you say it's worth doing so? If so, I can create a new branch and submit.
Otherwise, it's probably worth abandoning this PR.

@sburns
Copy link

sburns commented Jun 18, 2018

there is still a need/desire for this functionality from people

I hope the continued +1 comments are evidence of this.

@sujaymansingh
Copy link
Contributor Author

Good point @sburns !

@jamesls
Copy link
Member

jamesls commented Jun 25, 2018

Hi everyone sorry for the delay on this. I'm taking a look at this now. I'll assign this to myself and make sure this gets merged.

@jamesls jamesls self-assigned this Jun 25, 2018
default_chunk_size = 1024
return self._iter_lines(default_chunk_size)

def _iter_lines(self, chunk_size):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are people's thoughts on exposing this method directly as iter_lines() instead of having the default behavior of __iter__ be line based?

This would be similar to what requests exposes via it's iter_lines method. Looking at what they provide for __iter__, it appears to be chunks of 128 bytes. That's closer to how I'd expect a streaming response object to behave, though I'd bump up the chunks to maybe 1k or 4k.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the default iterator being larger chunks (4k but that's just me) and but also being able to get lines explicitly via iter_lines().

@jamesls
Copy link
Member

jamesls commented Jun 25, 2018

Given the age of the PR and that this was against your develop branch, I was hesitant to push to your fork so I pulled in your PR and incorporated some of the feedback I had. PR here: #1491

@sujaymansingh
Copy link
Contributor Author

That's a great idea @jamesls! Thanks 👍

@jamesls jamesls merged commit 6866ab7 into boto:develop Jun 27, 2018
@jamesls
Copy link
Member

jamesls commented Jun 27, 2018

Merged via #1491

@sujaymansingh
Copy link
Contributor Author

Thanks @jamesls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement This issue requests an improvement to a current feature. incorporating-feedback
Projects
None yet
Development

Successfully merging this pull request may close these issues.