Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specifying a custom KeyRangesIterator for a query #79

Open
soundofjw opened this issue Oct 16, 2015 · 1 comment
Open

Specifying a custom KeyRangesIterator for a query #79

soundofjw opened this issue Oct 16, 2015 · 1 comment
Labels

Comments

@soundofjw
Copy link

We have a case where we have deterministic keys on entities that share prefixes is certain cases.

Basically, we'd like to query by this prefix, which right now is also contained in a ComputedProperty (which drives the query in its current iteration).

To visualize:

  • 25 million entities with key IDs that start with "abc"
  • 2 million entities with key IDs that start with "def"

Using the current query, we see very poor sharding. One shard handles everything in the "def" range.

Knowing that our keys are prefixed this way, we can easily determine the first and last Key for the KeyRange, but I'm not sure of how we would utilize this knowledge to create the sharding we'd like to see.

Any ideas?

@tkaitchuck
Copy link
Contributor

There are two ways to do sharding. Splitting lexicographical or by using the scatter property. The later is better, and used if it is possible automatically. If you are doing a MR over all entities of a given type but that have strangely distributed IDs, that should just work out of the box, as it will use the scatter property to find the split points. So I'm assuming your problem is actually more complicated and that you mean to say that your table looks like:
123-a
123-b
...
456-abc-1
...
456-abc-200
...
456-def-1
...
789-a

and you only want to include the ones starting with "456" but the sub-ranges "abc" and "def" under it are very uneven. In this case you need to cajole your use-case into working with scatter. Using a prefix match will not help with this because a prefix of 456 is going to get split into "id >= 456 and id < 457" which means the only way to split that up is lexicographical. Instead you can add a property that is "456" to the entities. Then if you do a MR over the keyspace with a filter on the property "newproperty=456" then you can use it as a filter and it will split accurately regardless of how things are distributed by ID.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants