-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bigquery table.insert fails for identical rows #1041
Comments
Thanks for offering! I think you just want to skip the insertId then, since its purpose is to prevent duplicate data from being inserted. If that sounds right, we could just add an options object like: var opts = {
allowDuplicates: true // (default: false)
};
table.insert(rows, [opts], callback); When it's |
Sent a PR: #1045 |
@JamesCharlesworth while trying to test this out for #1045, I can't actually reproduce the error. Can you show me some code I can use that should cause the error? |
https://github.com/GoogleCloudPlatform/gcloud-node/blob/c7b49cb8fec15d6d159db6d7b8832e8ac4e50d0e/lib/bigquery/table.js#L902-L910 I think the insertId generation should be left to user. If user is copying data from another db, then user may prefer to use existing db id as insertId. Just like mentioned in Google API Docs the insertId should be a 'opt-in' feature. the insertId deduplication is also not guaranteed, so generating insertId automatically may lead to inconsistent behaviour. |
👍 Would you (or @JamesCharlesworth) be open for sending a PR that realizes your vision for how that should be implemented? |
I looked into code and tried to remove insertId feature. But it will break existing applications. Instead we can add new api 'insertAll' and make it transparent. User should build and pass the request body, as per spec. In this way, user can use 'insertId' ,'templateSuffix', 'ignoreUnknownValues', 'skipInvalidRows' features. In current implementation of 'insert', user can't use any of these features. I prefer, gcloud-node should be just utility library to call api endpoints. Its should not add any magic to the call. I just checked python api, its very clean, and just matches with api spec. |
That's more like https://github.com/google/google-api-nodejs-client -- which is the Node version of the Python library you linked to. We purposely design an easier to use API around the raw, upstream JSON API. And that comes with challenges. It's not always clear where our API needs to allow 100% upstream transparency and when lesser coverage is okay. Each call is different than the next. That's why we chose to default I still agree
If we can find a good answer to 2, I'm all for supporting it. I would prefer we blend support into the original var row = {
json: {/*...*/},
insertId: '...'
};
table.insert([row, row], { raw: true }, function() {}); @callmehiphop thoughts? |
I was not aware of https://github.com/google/google-api-nodejs-client , Probably you can link to the library in README.md, the current repo name https://github.com/GoogleCloudPlatform/gcloud-node looks like official library. somehow i landed here from google search results, and assumed that this is official library. Thanks. |
thanks, I missed it. If possible, add to this page also. https://cloud.google.com/nodejs/ |
@stephenplusplus IMO introducing options for this method seems like the way to go. I like the raw option. Maybe we could also include an option for generating the insertIds when raw is not flagged. |
Was this resolved by #1068? |
Sure was! |
Awesome, thank you guys. really appreciated this! |
Re-opening based on feedback from @jgeewax here
@jgeewax I think it was this comment that left me thinking it was best to stop defaulting something that the API docs said wasn't a good idea:
I believe you're suggesting that all of this is behind the scenes, as opposed to the user initiating the retry. To review our insert methods real quick:
If the underlying goal is safe-retry-able insert operations, would resumable uploads be a better solution? We currently don't support it, but we could (please re-open that issue if we need that). However, this wouldn't work for |
This part of my last post isn't right:
I forgot I made the GCS resumable uploads API work with with |
I'm focusing specifically on the Imagine you have an app and you want to log events for later analysis to BQ (e.g. at the end of each Uber trip, they might make a request to BQ saying "this trip happened: "). If something happens that would cause an auto-retry (e.g., network cable popped out and the initial try got a timeout), our retry logic, if it doesn't have an insert ID, may add these same rows twice (this happens if the network cable gets yanked after the request has made it to BQ). I think the auto-generation that we had (hashing the data) was misplaced in that it could convey to people that BQ is capable of de-duplicating data on the way in when it really is not. The insert ID is not for de-duplicating data itself, but instead for de-duplicating requests. In other words, it's there to make sure that requests which returned an ambiguous result (e.g., the data may have made it into BQ, but you can't be sure) can be safely retried without causing duplicate data to be added. Using a UUID-4 (random values) means that these types of errors (e.g., "Something might have gone wrong, but it's also possible that everything is fine...") can be corrected with a retry ("Just try it again with the same randomly generated value, no problem at all, worst case is we disregard the request because we already got it"). This also means that client-side errors (e.g., the process crashed before we know if the data was added) are not solved at all -- users are still responsible for that, which I think is the right thing to do. |
But would this happen when we're only making one HTTP request? I believe the request goes through with transfer-encoding: chunk. Is the upstream API parsing each bit it gets, inserting the rows as it notices they are defined in the bits, or does it wait until all data is received before storing any data? I have been assuming the API sees an incomplete transmission as a failure, so it wouldn't save any of it. So our retrying would be safe, because none of the last attempts could have been stored. |
I think my answer is in here, but I'm trying to figure it out:
Is that what we're trying to fix? |
Yes. In that scenario, if we don't have an insert ID, the retry is not safe and may end up with data saving twice. It's important to make clear that we're not trying to use insert ID as data de-duplication but as request de-duplication. That is, this should be a per-request ID not a "per-data" ID. Does that make sense? |
Alright, so we are solving for:
|
Correct.
Yes. Imagine sending 1k QPS and yanking the cable. Or any other network blip. At least some of those requests are going to be in this situation. |
Okay, thanks for walking me through it. I'll assign a uuid to non-insertId'd rows of data. |
Can we run this by someone on the BQ team to be sure this is right ?
|
Know anyone? |
@swcloud do you know anyone from the BigQuery team who can help review this suggestion? Thanks! |
@callmehiphop might this be addressed by the bigquery work you're currently looking at? |
@danoscarmike AFAIK there is not a requirement that covers this specific issue, however if we could get some feedback from a BigQuery team member I'd be happy to sneak it in. |
@jba could you run this by the best POC on BigQuery? Thanks! |
@danoscarmike There's so much going on in this issue that I'm not sure what "this" is. But I can tell you:
If that still leaves open questions for BQ team, let me know what they are and I'll relay. |
fix: use google-gax v3.3.0 Source-Link: googleapis/synthtool@c73d112 Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-nodejs:latest@sha256:b15a6f06cc06dcffa11e1bebdf1a74b6775a134aac24a0f86f51ddf728eb373e
It seems to be failing sometimes when you insert identical rows in parallel. I think this is because the insertId is being generated from a stringified md5 hash of the row. Would you be open to a pull request to pass in your own insert id in an optional metadata object? It could fallback to how it operates now.
https://github.com/GoogleCloudPlatform/gcloud-node/blob/fbf1ecc36c45c48aeee5469e9729a529d94ee610/lib/bigquery/table.js#L904
The text was updated successfully, but these errors were encountered: