-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support loading datasets of 10GB in BQ in less than 5 min #429
Comments
results of
|
Below are the results of the profiling on the load_file operator with the
tottime : for the total time spent in the given function (and excluding time made in calls to sub-functions) Cumtime: is the cumulative time spent in this and all subfunctions (from invocation till exit). This figure is accurate even for recursive functions. On doing the further investigation found the pd.read_csv() was taking 49.5 sec 1 0.000 0.000 49.588 49.588 readers.py:491(read_csv) Solution We can leverage multithreading, and create a pool of threads that can read the file partially and process the rows and ingest the data in the upstream database. As this approach is overkill for smaller files we can be selective about the file size to apply this. Assumptions and limitations
Results
Stand alone python script
|
Waiting for Kaxil's review before we merge. |
https://cloud.google.com/bigquery-transfer/docs/s3-transfer - Bigquery might be able to support loading files from S3 |
What is the current behavior? When we run the load_file operator for a table in Bigquery and a file in GCS, the data is first loaded to local from the GCS and then uploaded in Bigquery. Because of this, we require both memory and network bandwidth on the worker node. partially closes: #429 What is the new behavior? We have used an optimized path that can directly ingest data from GCS to Bigquery, therefore removing both memory and bandwidth requirements on a worker node. Because of this, the speed and amount of data transfer increase. Does this introduce a breaking change? No
What is the current behavior? Currently, we are transferring the data from local to Bigquery by first creating the dataframe and then using pandas.to_sql() to ingest data into bigquery. closes: #534 related: #429 What is the new behavior? We intend to use google's a python SDK to directly ingest data and we can save time that is wasted on converting files into data frames. Does this introduce a breaking change? Nope
Description What is the current behavior? Benchmarking results for the S3 to bigquery is missing related: #429 What is the new behavior? Added benchmarking results for S3 to Bigquery Does this introduce a breaking change? No
Dependencies
Acceptance criteria
The text was updated successfully, but these errors were encountered: