Introduce a filter hook mechanism for engine.schema_cache #131

Xiaoming-lw · 2024-10-11T18:43:06Z

Xiaoming-lw
Oct 11, 2024

As the discussion in messy environment management , we are facing a difficult situation to bring SnowDDL into current production sites.

There are a huge number of objects inside current Snowflake (O(1000) for databases, O(1000*100) for schemas), it will take a long time to run SnowDDL, by default 8 workers, would be 20+ minutes for "database,schema".
There are dynamic and temporary objects.
Most of these databases are shared by different teams, and it is hard to change the current hierarchy of roles and privileges. We don't know what potential issues would be occurred.

In order to skip the unnecessary databases and focus on dozens of databases in the beginning phase, we write customized configs and dump a list of the unnecessary databases. The __custom/*.py can make sure we only match the structures of the small number databases.

But In current design of SnowDDL, even we restrict the objects types to "database,schema,table", it will still walk through all the tables include which we try to skip in __custom/*.py. That is loaded in the engine.schema_cache. It would be easy to add some if statements into the for loop to filter result of show database like .... But it is invasive and hard to maintain.

And also there is a discussion in flexible object filtering, an "expression" is introduced to include/exclude specific databases. It would help, but still has limitations to more complicated cases, like pattern match for "database/schema/table" name.

So how about we introduce a filter hook mechanism in the load phase of schema_cache, to make it possible to programmatically customize the process of loading objects from Snowflake?

littleK0i · 2024-10-11T19:07:24Z

littleK0i
Oct 11, 2024
Maintainer

20 minutes is still quite a lot. SnowDDL should be generally faster than other tools due to built-in parallelism.

Could you run SnowDDL with --show-timers CLI option? It should help us to understand which phase takes the most time. This option requires an update to 0.32.0.

Did you try to increase --max-workers to 16 or more? By default it is 8. This option directly affects number of worker threads.

Can you potentially use separate owner role to stop SnowDDL from reading objects from other databases? SnowDDL should skip all databases which are not owned by current role. It also includes everything inside databases: schemas and schema objects. Code: https://github.com/littleK0i/SnowDDL/blob/master/snowddl/cache/schema_cache.py#L30-L31

With __custom/*.py, do you call config.remove_blueprint() to remove blueprints from config entirely?

The main problem with filtering is related to objects being tightly connected with each other. For example, schema roles are required for grants. But in order to create schema roles we need schemas. If some schemas are skipped, roles and grants should be skipped as well. It would be relatively hard to implement filtering feature reliably outside of most basic use cases.

Also, people will start to rely on idea of "applying config partially". Which naturally leads to the situation when parts of config are incomplete or broken, and it is not obvious.

Let's see if we can figure out something else.

Btw, I might be available for consulting starting from ~21 Oct. Converting everything automatically and cross-validating with a separate test Snowflake account might be easier than doing slow & painful step-by-step migration for months.

6 replies

Xiaoming-lw Oct 11, 2024
Author

I think the "filter" should be an advanced feature, and it is not an open-box tools for users. So developer should know the dependency inside objects which need to be manipulated, and take them into account.

It seems to be a solution that transfer the ownership of specified dozens of databases to another role and test in a isolated "testing" account. But we still need to face the production account. It's hard to do the ownership transfer against huge data and the running services above them.

littleK0i Oct 12, 2024
Maintainer

About 6 minutes for InitEngine alone is quite long. Internally it runs only SHOW DATABASES and SHOW SCHEMAS IN DATABASE commands with no additional logic. Should be lightweight.

Do you have anything which may cause very long network round-trip time to Snowflake servers by any chance?

For example:

running SnowDDL from laptop on 3G connection;
corporate VPN / PrivateLink;
network packet monitors (e.g. CarbonBlack);
being located very far from Snowflake account region;

Even if we optimise this part somehow, role management will be a few magnitudes slower. For every schema 3 roles will be created automatically, with lots of future grants.

Also, SnowDDL assumes schemas will not be deleted during execution. This is very unusual.

Xiaoming-lw Oct 22, 2024
Author

Hi @littleK0i ,

I think the reason is simply the large number of databases and schemas. Given a total of 500000 objects, each show query takes 100ms, and with 100 threads running in parallel, it will take 500s to run snowddl each time.

Considering the difficulty and time required to split and transfer the current roles, it is more direct and easier to skip these unnecessary objects in schema_cache.

Here is a prototype for introducing filter hooks to SnowDDL, would you please take a look at it. If it is a potential possible solution, I would like to complete and contribute it.

master...Xiaoming-lw:SnowDDL:custom_filter

littleK0i Oct 23, 2024
Maintainer

@Xiaoming-lw , since you have a very unusual and niche use-case, the best approach might be to:

Clone repo and put it into your code base, treating it as internal library.
Make changes directly inside the code.

For example, you may add filters logic directly into schema cache code. I expect a lot more changes to come as you keep going. With so many objects and specific requirements it probably makes the most sense.

We probably should not add such filters into main lib code, since it would break a lot of things for normal usage, such as:

Cross-references between schemas, when objects in one schema refers to objects in another schema, but one of schemas are excluded.
Schema roles and grants for schemas which are excluded are likely going to fail.
SnowDDL is still going to try and create objects in excluded schemas. Currently the code assumes that all schemas mentioned in config are going to exist and properly configured before resolving any schema objects.

Xiaoming-lw Oct 23, 2024
Author

Thank you @littleK0i for your suggestion. I understand what you are concerning, and we will keep the experiment in the cloned repo to fit for the use case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce a filter hook mechanism for engine.schema_cache #131

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Introduce a filter hook mechanism for engine.schema_cache #131

Xiaoming-lw Oct 11, 2024

Replies: 1 comment · 6 replies

littleK0i Oct 11, 2024 Maintainer

Xiaoming-lw Oct 11, 2024 Author

littleK0i Oct 12, 2024 Maintainer

Xiaoming-lw Oct 22, 2024 Author

littleK0i Oct 23, 2024 Maintainer

Xiaoming-lw Oct 23, 2024 Author

Xiaoming-lw
Oct 11, 2024

Replies: 1 comment 6 replies

littleK0i
Oct 11, 2024
Maintainer

Xiaoming-lw Oct 11, 2024
Author

littleK0i Oct 12, 2024
Maintainer

Xiaoming-lw Oct 22, 2024
Author

littleK0i Oct 23, 2024
Maintainer

Xiaoming-lw Oct 23, 2024
Author