feat: Add sync sharding #1891

erezrokah · 2024-09-11T13:09:23Z

Summary

Goes with cloudquery/plugin-pb-go#401. ~~Still testing this so in draft~~

Part of https://github.com/cloudquery/cloudquery-issues/issues/2214 (internal issue)

Use the following steps to ensure your PR is ready to be reviewed

Read the contribution guidelines 🧑‍🎓
Run go fmt to format your code 🖊
Lint your changes via golangci-lint run 🚨 (install golangci-lint here)
Update or add tests 🧪
Ensure the status checks below are successful ✅

erezrokah · 2024-09-17T16:13:56Z

examples/simple_plugin/go.mod

@@ -1,6 +1,8 @@
 module github.com/cloudquery/plugin-sdk/examples/simple_plugin


These changes can be reverted once cloudquery/plugin-pb-go#401 is merged and released

erezrokah · 2024-09-17T16:15:39Z

scheduler/scheduler_dfs.go

@@ -40,27 +40,34 @@ func (s *syncClient) syncDfs(ctx context.Context, resolvedResources chan<- *sche
 		s.metrics.initWithClients(table, clients)
 	}

-	var wg sync.WaitGroup
+	tableClients := make([]tableClient, 0)


Historically the DFS scheduler didn't need to create table clients pairs, since we didn't do any sorting in the DFS scheduler. Because of the sharding support, we need to first the table client pairs, so we can shard them before the sync starts

marianogappa · 2024-09-17T16:19:59Z

scheduler/scheduler_shuffle.go

@@ -45,6 +45,7 @@ func (s *syncClient) syncShuffle(ctx context.Context, resolvedResources chan<- *
 	// so users have a little bit of control over the randomization.
 	seed := hashTableNames(tableNames)
 	shuffle(tableClients, seed)
+	tableClients = shardTableClients(tableClients, s.shard)


I do see that shuffle is deterministic (at the moment), but I still think it's a bad idea to shard after shuffling. I'd move it before the shuffle.

OK let me try and switch the order and re-run the tests. We shuffle (this is the default in AWS) to avoid rate limits. Don't think sharding before shuffling will make a difference in that aspect but I'll re-test

Don't think sharding before shuffling will make a difference in that aspect but I'll re-test

I think it will be fine since we round-robin before we shuffle anyways

Ok did a bit of testing and it looks good so we can shard before shuffle

Is there any case where collecting the tables could be non-deterministic? Normally the tables are hardcoded in a plugin, so it should not be the case. If there is a plugin where the tables are dynamic, and they could change between syncs (e.g. if they are discovered by an API which is non-deterministic), sharding would not work.

In either case, I think the deterministic requirement is worth a one-liner comment.

Is there any case where collecting the tables could be non-deterministic?

This is a good point, and definitely a limitation of this approach, see below ⬇️

It can happen due to a bug in the plugin https://github.com/cloudquery/cloudquery-private/pull/4299.

Plugins with dynamic tables don't use the scheduler, they do their own thing so they would need to implement sharding on the plugin's side (if needed. e.g. For Postgres source probably better to use a stronger machine instead of sharding)

I can think of other cases, e.g. someone creating an AWS account after shard 1/2 discovery and before shard 2/2 discovery. If we discover all accounts, that will mess up the sharding. A solution would be to hard code the accounts in the spec to avoid it.

Added a comment about the requirement

scheduler/scheduler.go

🤖 I have created a release *beep* *boop* --- ## [4.63.0](v4.62.0...v4.63.0) (2024-09-18) ### Features * Add sync sharding ([#1891](#1891)) ([e1823f8](e1823f8)) ### Bug Fixes * **deps:** Update module github.com/cloudquery/plugin-pb-go to v1.22.3 ([#1895](#1895)) ([b05d24b](b05d24b)) * **deps:** Update module google.golang.org/grpc to v1.66.2 ([#1893](#1893)) ([6d70b88](6d70b88)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).

#### Summary Follow up to cloudquery/plugin-sdk#1891

#### Summary We'll need to release a few plugins with cloudquery/plugin-sdk#1891 first, hence the future date in the command description

github-actions bot added the feat label Sep 11, 2024

erezrokah force-pushed the feat/sharding branch 3 times, most recently from bd09fa9 to 70bc550 Compare September 17, 2024 14:01

erezrokah marked this pull request as ready for review September 17, 2024 16:13

erezrokah requested a review from a team as a code owner September 17, 2024 16:13

erezrokah requested a review from marianogappa September 17, 2024 16:13

github-actions bot added feat and removed feat labels Sep 17, 2024

erezrokah commented Sep 17, 2024

View reviewed changes

erezrokah changed the title ~~feat: Add sync sharding options~~ feat: Add sync sharding Sep 17, 2024

erezrokah changed the title ~~feat: Add sync sharding~~ feat: Add sync Sep 17, 2024

erezrokah changed the title ~~feat: Add sync~~ feat: Add sync sharding Sep 17, 2024

github-actions bot added feat and removed feat labels Sep 17, 2024

marianogappa reviewed Sep 17, 2024

View reviewed changes

erezrokah requested a review from marianogappa September 17, 2024 17:17

erezrokah mentioned this pull request Sep 17, 2024

feat: Add sharding support cloudquery/cloudquery#19169

Merged

erezrokah added 2 commits September 18, 2024 10:30

feat: Add sync sharding options

40adee3

fix: Shard then shuffle

58d9c71

erezrokah force-pushed the feat/sharding branch from bfecb65 to 58d9c71 Compare September 18, 2024 09:30

chore: Tidy

2381af3

marianogappa approved these changes Sep 18, 2024

View reviewed changes

test: Add comment

09d8c57

erezrokah added the automerge label Sep 18, 2024

kodiakhq bot merged commit e1823f8 into main Sep 18, 2024
7 checks passed

kodiakhq bot deleted the feat/sharding branch September 18, 2024 10:14

cq-bot mentioned this pull request Sep 18, 2024

chore(main): Release v4.63.0 #1894

Merged

erezrokah mentioned this pull request Sep 18, 2024

feat: Add sharding support cloudquery/cloudquery#19205

Merged

kodiakhq bot pushed a commit to cloudquery/cloudquery that referenced this pull request Sep 19, 2024

feat: Add sharding support (#19205)

7fd1e2c

#### Summary Follow up to cloudquery/plugin-sdk#1891

kodiakhq bot pushed a commit to cloudquery/cloudquery that referenced this pull request Sep 19, 2024

feat: Add sharding support (#19169)

e9dfd0b

#### Summary We'll need to release a few plugins with cloudquery/plugin-sdk#1891 first, hence the future date in the command description

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add sync sharding #1891

feat: Add sync sharding #1891

erezrokah commented Sep 11, 2024 •

edited

Loading

erezrokah Sep 17, 2024

erezrokah Sep 17, 2024

marianogappa Sep 17, 2024

erezrokah Sep 17, 2024

erezrokah Sep 17, 2024

erezrokah Sep 17, 2024

marianogappa Sep 18, 2024

erezrokah Sep 18, 2024 •

edited

Loading

erezrokah Sep 18, 2024

		@@ -1,6 +1,8 @@
		module github.com/cloudquery/plugin-sdk/examples/simple_plugin

feat: Add sync sharding #1891

feat: Add sync sharding #1891

Conversation

erezrokah commented Sep 11, 2024 • edited Loading

Summary

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erezrokah Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erezrokah commented Sep 11, 2024 •

edited

Loading

erezrokah Sep 18, 2024 •

edited

Loading