Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bigquery table.insert fails for identical rows #1041

Closed
j-5-s opened this issue Jan 2, 2016 · 31 comments
Closed

bigquery table.insert fails for identical rows #1041

j-5-s opened this issue Jan 2, 2016 · 31 comments
Assignees
Labels
api: bigquery Issues related to the BigQuery API.

Comments

@j-5-s
Copy link

j-5-s commented Jan 2, 2016

It seems to be failing sometimes when you insert identical rows in parallel. I think this is because the insertId is being generated from a stringified md5 hash of the row. Would you be open to a pull request to pass in your own insert id in an optional metadata object? It could fallback to how it operates now.

https://github.com/GoogleCloudPlatform/gcloud-node/blob/fbf1ecc36c45c48aeee5469e9729a529d94ee610/lib/bigquery/table.js#L904

@stephenplusplus
Copy link
Contributor

Thanks for offering! I think you just want to skip the insertId then, since its purpose is to prevent duplicate data from being inserted. If that sounds right, we could just add an options object like:

var opts = {
  allowDuplicates: true // (default: false)
};

table.insert(rows, [opts], callback);

When it's true, we just wouldn't set the insertId.

@jgeewax @fhoffa is this a sane plan?

@stephenplusplus stephenplusplus added the api: bigquery Issues related to the BigQuery API. label Jan 2, 2016
@stephenplusplus
Copy link
Contributor

Sent a PR: #1045

@stephenplusplus
Copy link
Contributor

@JamesCharlesworth while trying to test this out for #1045, I can't actually reproduce the error. Can you show me some code I can use that should cause the error?

@palamccc
Copy link

palamccc commented Jan 5, 2016

https://github.com/GoogleCloudPlatform/gcloud-node/blob/c7b49cb8fec15d6d159db6d7b8832e8ac4e50d0e/lib/bigquery/table.js#L902-L910
As of now md5(JSON.stringify(row)) is used as insertId, This is not a robust one. JSON.Stringify is not guaranteed to give same result for similar objects. and md5 can be same for two different strings.

I think the insertId generation should be left to user. If user is copying data from another db, then user may prefer to use existing db id as insertId. Just like mentioned in Google API Docs the insertId should be a 'opt-in' feature. the insertId deduplication is also not guaranteed, so generating insertId automatically may lead to inconsistent behaviour.

@stephenplusplus
Copy link
Contributor

I think the insertId generation should be left to user.

👍 Would you (or @JamesCharlesworth) be open for sending a PR that realizes your vision for how that should be implemented?

@palamccc
Copy link

I looked into code and tried to remove insertId feature. But it will break existing applications.

Instead we can add new api 'insertAll' and make it transparent. User should build and pass the request body, as per spec. In this way, user can use 'insertId' ,'templateSuffix', 'ignoreUnknownValues', 'skipInvalidRows' features. In current implementation of 'insert', user can't use any of these features.

I prefer, gcloud-node should be just utility library to call api endpoints. Its should not add any magic to the call. I just checked python api, its very clean, and just matches with api spec.

@stephenplusplus
Copy link
Contributor

I prefer, gcloud-node should be just utility library to call api endpoints

That's more like https://github.com/google/google-api-nodejs-client -- which is the Node version of the Python library you linked to. We purposely design an easier to use API around the raw, upstream JSON API.

And that comes with challenges. It's not always clear where our API needs to allow 100% upstream transparency and when lesser coverage is okay. Each call is different than the next. That's why we chose to default insertId in the first place-- when sensible, we want each method to be completely "ready-to-go" for the common use case.

I still agree insertId can have its quirks, so it's probably best to avoid defaulting it to a value. The questions I think we have are:

  1. Do we need to allow insertId, templateSuffix, and ignoreUnknownValues to be set at all? The way our library is now is the way its always been-- defaulting insertId and not allowing the other properties to be set at all. This issue was the first to bring it up, and it can be solved by just removing insertId.
  2. If we do allow setting these properties, how can we do it?
    2a. A new insertAll method
    2b. ??

If we can find a good answer to 2, I'm all for supporting it. I would prefer we blend support into the original insert method somehow, e.g.

var row = {
  json: {/*...*/},
  insertId: '...'
};

table.insert([row, row], { raw: true }, function() {});

@callmehiphop thoughts?

@palamccc
Copy link

I was not aware of https://github.com/google/google-api-nodejs-client , Probably you can link to the library in README.md, the current repo name https://github.com/GoogleCloudPlatform/gcloud-node looks like official library. somehow i landed here from google search results, and assumed that this is official library. Thanks.

@stephenplusplus
Copy link
Contributor

In fact, we do!

screen shot 2016-01-10 at 2 08 59 pm

Sorry for the confusion.

@palamccc
Copy link

thanks, I missed it. If possible, add to this page also. https://cloud.google.com/nodejs/

@stephenplusplus
Copy link
Contributor

Related issue: #556. I'm not sure who runs the docs pages, but I'll tag @jgeewax to make sure someone sees your feedback.

@callmehiphop
Copy link
Contributor

@stephenplusplus IMO introducing options for this method seems like the way to go. I like the raw option. Maybe we could also include an option for generating the insertIds when raw is not flagged.

@callmehiphop
Copy link
Contributor

Was this resolved by #1068?

@stephenplusplus
Copy link
Contributor

Sure was!

@j-5-s
Copy link
Author

j-5-s commented Jan 26, 2016

Awesome, thank you guys. really appreciated this!

@stephenplusplus
Copy link
Contributor

Re-opening based on feedback from @jgeewax here

Can we re-open the discussion about using a UUID-1 (or similar) to generate insert ID values when none are provided (and which are only used when we automatically retry a failed request) ?

@jgeewax I think it was this comment that left me thinking it was best to stop defaulting something that the API docs said wasn't a good idea:

Just like mentioned in Google API Docs the insertId should be a 'opt-in' feature. the insertId deduplication is also not guaranteed, so generating insertId automatically may lead to inconsistent behaviour. (source)


  • Insert rows r
  • Request times out (maybe network cable was yanked, maybe GCP had an error)
  • It's impossible to know for sure whether this batch of rows was added
  • Retry inserting rows r

I believe you're suggesting that all of this is behind the scenes, as opposed to the user initiating the retry. To review our insert methods real quick:

  • table.insert(r): Makes an all-in-one API request. If the request fails, I don't think the API saves any of the data, as the network transmission wouldn't have been properly completed. Retrying is done automatically if there is a failure.
  • table.createWriteStream(). This won't retry at all. Last I tried, retrying writable streams is impossible.

If the underlying goal is safe-retry-able insert operations, would resumable uploads be a better solution? We currently don't support it, but we could (please re-open that issue if we need that). However, this wouldn't work for table.createWriteStream() as mentioned above, and only with insert.

@stephenplusplus
Copy link
Contributor

stephenplusplus commented May 7, 2017

This part of my last post isn't right:

However, this wouldn't work for table.createWriteStream() as mentioned above, and only with insert.

I forgot I made the GCS resumable uploads API work with with file.createWriteStream(), which uses gcs-resumable-upload under the hood. The reason we don't already do this in BigQuery is because I figured using the GCS upload -> BigQuery import was a safer solution when you're doing larger data inserts. I should be able to figure this out with the BigQuery insert API as well, if we decide we want to support this.

@jgeewax
Copy link
Contributor

jgeewax commented May 7, 2017

I'm focusing specifically on the table.insert(data) (no interest at all with the streaming inserts).

Imagine you have an app and you want to log events for later analysis to BQ (e.g. at the end of each Uber trip, they might make a request to BQ saying "this trip happened: "). If something happens that would cause an auto-retry (e.g., network cable popped out and the initial try got a timeout), our retry logic, if it doesn't have an insert ID, may add these same rows twice (this happens if the network cable gets yanked after the request has made it to BQ).

I think the auto-generation that we had (hashing the data) was misplaced in that it could convey to people that BQ is capable of de-duplicating data on the way in when it really is not.

The insert ID is not for de-duplicating data itself, but instead for de-duplicating requests. In other words, it's there to make sure that requests which returned an ambiguous result (e.g., the data may have made it into BQ, but you can't be sure) can be safely retried without causing duplicate data to be added.

Using a UUID-4 (random values) means that these types of errors (e.g., "Something might have gone wrong, but it's also possible that everything is fine...") can be corrected with a retry ("Just try it again with the same randomly generated value, no problem at all, worst case is we disregard the request because we already got it").

This also means that client-side errors (e.g., the process crashed before we know if the data was added) are not solved at all -- users are still responsible for that, which I think is the right thing to do.

@stephenplusplus
Copy link
Contributor

If something happens that would cause an auto-retry (e.g., network cable popped out and the initial try got a timeout), our retry logic, if it doesn't have an insert ID, may add these same rows twice (this happens if the network cable gets yanked after the request has made it to BQ).

But would this happen when we're only making one HTTP request? I believe the request goes through with transfer-encoding: chunk. Is the upstream API parsing each bit it gets, inserting the rows as it notices they are defined in the bits, or does it wait until all data is received before storing any data? I have been assuming the API sees an incomplete transmission as a failure, so it wouldn't save any of it. So our retrying would be safe, because none of the last attempts could have been stored.

@stephenplusplus
Copy link
Contributor

If something happens that would cause an auto-retry (e.g., network cable popped out and the initial try got a timeout), our retry logic, if it doesn't have an insert ID, may add these same rows twice (this happens if the network cable gets yanked after the request has made it to BQ).

I think my answer is in here, but I'm trying to figure it out:

  1. "my-app" calls table.insert()
  2. Network cable pops out, callback not executed to communicate results of test
  3. my-app goes "dang, network dropped out, I need to retry [step 1]"
  4. data saves twice

Is that what we're trying to fix?

@jgeewax
Copy link
Contributor

jgeewax commented May 8, 2017

Yes. In that scenario, if we don't have an insert ID, the retry is not safe and may end up with data saving twice.

It's important to make clear that we're not trying to use insert ID as data de-duplication but as request de-duplication. That is, this should be a per-request ID not a "per-data" ID. Does that make sense?

@stephenplusplus
Copy link
Contributor

Alright, so we are solving for:

  • We send a request
  • BigQuery inserts the data
  • Network drops out between ending the POST transmission and receiving the response (this is what I've been stuck on. This isn't a streaming request, it's an all-in-one. Wouldn't this be a very, very narrow slice of time?)
  • Network wakes up
  • Our retry logic only got a network hangup/timeout response, so it makes the request again

@jgeewax
Copy link
Contributor

jgeewax commented May 10, 2017

This isn't a streaming request, it's an all-in-one.

Correct.

Wouldn't this be a very, very narrow slice of time?

Yes. Imagine sending 1k QPS and yanking the cable. Or any other network blip. At least some of those requests are going to be in this situation.

@stephenplusplus
Copy link
Contributor

Okay, thanks for walking me through it. I'll assign a uuid to non-insertId'd rows of data.

@stephenplusplus stephenplusplus self-assigned this May 10, 2017
@jgeewax
Copy link
Contributor

jgeewax commented May 10, 2017 via email

@stephenplusplus
Copy link
Contributor

Know anyone?

@stephenplusplus
Copy link
Contributor

@swcloud do you know anyone from the BigQuery team who can help review this suggestion? Thanks!

@danoscarmike
Copy link
Contributor

@callmehiphop might this be addressed by the bigquery work you're currently looking at?

@callmehiphop
Copy link
Contributor

@danoscarmike AFAIK there is not a requirement that covers this specific issue, however if we could get some feedback from a BigQuery team member I'd be happy to sneak it in.

@danoscarmike
Copy link
Contributor

@jba could you run this by the best POC on BigQuery? Thanks!

@jba
Copy link

jba commented Aug 9, 2017

@danoscarmike There's so much going on in this issue that I'm not sure what "this" is. But I can tell you:

  • BQ insert IDs are per-row.
  • The Go client lets the user provide their own in the "standard" case.
  • There are also cases (like uploading Go values directly, which for convenience we allow and convert into rows) where there is no surface for allowing the user to provide an insert ID, and we do not attempt to construct one. This can result in duplicate rows. We document this clearly.
  • In their recent set of requests, BQ team did not mention anything about client handling of insert IDs.

If that still leaves open questions for BQ team, let me know what they are and I'll relay.

sofisl pushed a commit that referenced this issue Jan 17, 2023
fix: use google-gax v3.3.0
Source-Link: googleapis/synthtool@c73d112
Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-nodejs:latest@sha256:b15a6f06cc06dcffa11e1bebdf1a74b6775a134aac24a0f86f51ddf728eb373e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API.
Projects
None yet
Development

No branches or pull requests

8 participants