Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cryptic Error when blob.download_to_file from within Google Cloud Dataflow #3836

Closed
bw4sz opened this issue Aug 17, 2017 · 11 comments
Closed
Assignees
Labels
api: storage Issues related to the Cloud Storage API. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@bw4sz
Copy link

bw4sz commented Aug 17, 2017

I am running a cloud dataflow pipeline which requires that each worker download some videos from a gcp bucket, process them, and reupload. This pipeline works locally, but when deployed to dataflow I get the cryptic error when using google-cloud-storage to download blobs.

with open(local_path, 'wb') as file_obj:
      blob.download_to_file(file_obj)

returns:

File "run_clouddataflow.py", line 48, in process
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/storage/blob.py", line 464, in download_to_file
    self._do_download(transport, file_obj, download_url, headers)
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/storage/blob.py", line 418, in _do_download
    download.consume(transport)
  File "/usr/local/lib/python2.7/dist-packages/google/resumable_media/requests/download.py", line 101, in consume
    self._write_to_stream(result)
  File "/usr/local/lib/python2.7/dist-packages/google/resumable_media/requests/download.py", line 62, in _write_to_stream
    with response:
AttributeError: __exit__ [while running 'Run DeepMeerkat']

The function in question is

  def process(self,element):

    import csv
    from google.cloud import storage
    from DeepMeerkat import DeepMeerkat
    from urlparse import urlparse
    import os
    import google.auth
    import logging

    DM=DeepMeerkat.DeepMeerkat()

    logging.info(os.getcwd())
    logging.info(element)

    #try adding credentials?
    #set credentials, inherent from worker
    credentials, project = google.auth.default()

    #download element locally
    parsed = urlparse(element[0])
    logging.info(parsed)

    #parse gcp path
    storage_client=storage.Client(credentials=credentials)
    bucket = storage_client.get_bucket(parsed.hostname)
    blob=storage.Blob(parsed.path[1:],bucket)

    #store local path
    local_path=parsed.path.split("/")[-1]

    logging.info('local path: ' + local_path)
    with open(local_path, 'wb') as file_obj:
      blob.download_to_file(file_obj)

    logging.info("Downloaded" + local_path)

    #Assign input from DataFlow/manifest
    DM.process_args(video=local_path)
    DM.process_args()    
    DM.args.output="Frames"

    #Run DeepMeerkat
    DM.run()

Mostly i'm trying to understand what this error means. Is this a permissions error? A latency error? A local write error? I cannot recreate outside of dataflow. What I can tell you from my logging is that dataflow worker is trying to write to write "video.avi" to root (i.e "/"). I've tried writing to /tmp/, but without knowing if this is a permissions error, either locally or from gcp bucket, i'm having trouble debugging. Can you give me a sense of what kind of error this stems from?

SO question is here.

Dataflow processes require a setup.py, so all packages are current pip install google-cloud.

I've tried explicitly passing credentials using google-auth, but doesn't seem to make a difference.

@dhermes
Copy link
Contributor

dhermes commented Aug 17, 2017

@bw4sz It's a package version error. We've "resolved" this by releasing google-cloud-storage==1.3.2 which requires requests>=2.18.0. (See #3814 and #3736)

Let me know if upgrading requests to 2.18.0+ and google-cloud-storage to 1.3.2 does not fix this, then we can re-open and investigate.

@dhermes dhermes closed this as completed Aug 17, 2017
@dhermes dhermes added api: storage Issues related to the Cloud Storage API. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Aug 17, 2017
@dhermes
Copy link
Contributor

dhermes commented Aug 17, 2017

@bw4sz Thanks a lot for taking the time to file. This (plus the related issues that lead to #3814) have revealed how big an issue it was to allow "any old version" there.

@bw4sz
Copy link
Author

bw4sz commented Aug 18, 2017

@dhermes That works thanks.

Just as a favor, would you comment on the following.

I'm running a clouddataflow task that pulls videos from my GCP bucket, processes locally, and reuploads them. Would you expect any difference in speed between gsutil and google-cloud-storage from within python? My videos are 3 to 4gb and whiles its hard to exactly track, I was expecting the latency among GCP to dataflow workers to be very fast. It doesn't appear so. Any guess on what speed might be reasonable to expect?

@dhermes
Copy link
Contributor

dhermes commented Aug 18, 2017

@bw4sz Yes. gsutil is highly optimized, google-cloud-storage has had 0 benchmarking at any point in it's lifetime. I would recommending using gsutil if the system call overhead doesn't drag your application performance down.

@bw4sz
Copy link
Author

bw4sz commented Aug 18, 2017

Thanks, just to follow up. blob.download_to_file is definitely having trouble with the very large files. Just testing locally on OSX, monitoring du -h in the destination folder, I can see that the 3gb file seems to just hang out at 300 MB. Is there a cap (should I submit new issue?).

My file is the .avi, checking du -h 10 minutes apart sees no movement, the python call is still hanging. Perhaps this isn't the right use case?

Bens-MBP:tmp ben$ du -h
  0B	./com.apple.launchd.IEP9Z62Yv7
  0B	./com.apple.launchd.uerhxnqCTy
365M	.
Bens-MBP:tmp ben$ du -h
  0B	./com.apple.launchd.IEP9Z62Yv7
  0B	./com.apple.launchd.uerhxnqCTy
365M	.
Bens-MBP:tmp ben$ ls
FH109_01.AVI			com.apple.launchd.uerhxnqCTy
com.apple.launchd.IEP9Z62Yv7

@dhermes
Copy link
Contributor

dhermes commented Aug 18, 2017

I'm a bit thrown off by the du -h output. Why is it showing . instead of the filename?

You say it hangs, so you mean the process freezes up?

@bw4sz
Copy link
Author

bw4sz commented Aug 18, 2017

I was just calling it in a silly way, here

Bens-MBP:tmp ben$ du -sh *
365M	FH109_01.AVI
  0B	com.apple.launchd.IEP9Z62Yv7
  0B	com.apple.launchd.uerhxnqCTy

I have two terminals open, one that was calling the apache beam pipeline (testing locally on DirectRunner) and one looking at the destination of the video file. I can see the video file get created, and it grows until that 365M size, and then just hangs out. The process is still "running", but its clearly just stuck on blob.download_to_file. I have a logging statement before and after

    logging.info('local path: ' + local_path)
    with open(local_path, 'wb') as file_obj:
      blob.download_to_file(file_obj)
    
    logging.info("Check local path exists: " + str(os.path.exists(local_path)))

and the stdout just chilling there

INFO:root:/Users/ben/Documents/DeepMeerkat
INFO:root:['gs://api-project-773889352370-testing/Hummingbirds/FH109_01.AVI']
INFO:root:ParseResult(scheme='gs', netloc='api-project-773889352370-testing', path='/Hummingbirds/FH109_01.AVI', params='', query='', fragment='')
INFO:root:local path: /tmp/FH109_01.AVI

it never makes it to the next logging statement.

Everything works great with test clips (~100mb)

@bw4sz
Copy link
Author

bw4sz commented Aug 18, 2017

Forgive me for pestering, but i'm sure there will be others interesting in this. How about compared to apache-beam gcsio?

This SO question
here would suggest we are supposed to use gcsio, but its not yet clear to me if thats just between buckets, or down to the local worker.

@kparaju
Copy link

kparaju commented Mar 7, 2018

I had to uninstall requests 2.18.4 (pip uninstall requests) and install 2.18.0 (pip install requests==2.18.0) to make it work

@TotangoRam
Copy link

Hi, I'm running requests 2.18.0 & google-cloud-storage==1.7.0 and get the same error.
what versions do I need now?

@RupertMa
Copy link

RupertMa commented Aug 4, 2018

Hi, I'm running requests 2.19.1 & google-cloud-storage==1.10.0 and get the same error.
what versions do I need now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

5 participants