-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add sync sharding #1891
feat: Add sync sharding #1891
Conversation
bd09fa9
to
70bc550
Compare
examples/simple_plugin/go.mod
Outdated
@@ -1,6 +1,8 @@ | |||
module github.com/cloudquery/plugin-sdk/examples/simple_plugin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes can be reverted once cloudquery/plugin-pb-go#401 is merged and released
@@ -40,27 +40,34 @@ func (s *syncClient) syncDfs(ctx context.Context, resolvedResources chan<- *sche | |||
s.metrics.initWithClients(table, clients) | |||
} | |||
|
|||
var wg sync.WaitGroup | |||
tableClients := make([]tableClient, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historically the DFS scheduler didn't need to create table clients pairs, since we didn't do any sorting in the DFS scheduler. Because of the sharding support, we need to first the table client pairs, so we can shard them before the sync starts
@@ -45,6 +45,7 @@ func (s *syncClient) syncShuffle(ctx context.Context, resolvedResources chan<- * | |||
// so users have a little bit of control over the randomization. | |||
seed := hashTableNames(tableNames) | |||
shuffle(tableClients, seed) | |||
tableClients = shardTableClients(tableClients, s.shard) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do see that shuffle
is deterministic (at the moment), but I still think it's a bad idea to shard after shuffling. I'd move it before the shuffle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK let me try and switch the order and re-run the tests. We shuffle (this is the default in AWS) to avoid rate limits. Don't think sharding before shuffling will make a difference in that aspect but I'll re-test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think sharding before shuffling will make a difference in that aspect but I'll re-test
I think it will be fine since we round-robin before we shuffle anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok did a bit of testing and it looks good so we can shard before shuffle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any case where collecting the tables could be non-deterministic? Normally the tables are hardcoded in a plugin, so it should not be the case. If there is a plugin where the tables are dynamic, and they could change between syncs (e.g. if they are discovered by an API which is non-deterministic), sharding would not work.
In either case, I think the deterministic requirement is worth a one-liner comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any case where collecting the tables could be non-deterministic?
This is a good point, and definitely a limitation of this approach, see below ⬇️
- It can happen due to a bug in the plugin https://github.com/cloudquery/cloudquery-private/pull/4299.
- Plugins with dynamic tables don't use the scheduler, they do their own thing so they would need to implement sharding on the plugin's side (if needed. e.g. For Postgres source probably better to use a stronger machine instead of sharding)
- I can think of other cases, e.g. someone creating an AWS account after shard 1/2 discovery and before shard 2/2 discovery. If we discover all accounts, that will mess up the sharding. A solution would be to hard code the accounts in the spec to avoid it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment about the requirement
bfecb65
to
58d9c71
Compare
🤖 I have created a release *beep* *boop* --- ## [4.63.0](v4.62.0...v4.63.0) (2024-09-18) ### Features * Add sync sharding ([#1891](#1891)) ([e1823f8](e1823f8)) ### Bug Fixes * **deps:** Update module github.com/cloudquery/plugin-pb-go to v1.22.3 ([#1895](#1895)) ([b05d24b](b05d24b)) * **deps:** Update module google.golang.org/grpc to v1.66.2 ([#1893](#1893)) ([6d70b88](6d70b88)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
#### Summary Follow up to cloudquery/plugin-sdk#1891
#### Summary We'll need to release a few plugins with cloudquery/plugin-sdk#1891 first, hence the future date in the command description
Summary
Goes with cloudquery/plugin-pb-go#401.
Still testing this so in draftPart of https://github.com/cloudquery/cloudquery-issues/issues/2214 (internal issue)
Use the following steps to ensure your PR is ready to be reviewed
go fmt
to format your code 🖊golangci-lint run
🚨 (install golangci-lint here)