Specifying a custom KeyRangesIterator for a query #79

soundofjw · 2015-10-16T23:59:32Z

We have a case where we have deterministic keys on entities that share prefixes is certain cases.

Basically, we'd like to query by this prefix, which right now is also contained in a ComputedProperty (which drives the query in its current iteration).

To visualize:

25 million entities with key IDs that start with "abc"
2 million entities with key IDs that start with "def"

Using the current query, we see very poor sharding. One shard handles everything in the "def" range.

Knowing that our keys are prefixed this way, we can easily determine the first and last Key for the KeyRange, but I'm not sure of how we would utilize this knowledge to create the sharding we'd like to see.

Any ideas?

tkaitchuck · 2015-10-22T06:42:17Z

There are two ways to do sharding. Splitting lexicographical or by using the scatter property. The later is better, and used if it is possible automatically. If you are doing a MR over all entities of a given type but that have strangely distributed IDs, that should just work out of the box, as it will use the scatter property to find the split points. So I'm assuming your problem is actually more complicated and that you mean to say that your table looks like:
123-a
123-b
...
456-abc-1
...
456-abc-200
...
456-def-1
...
789-a

and you only want to include the ones starting with "456" but the sub-ranges "abc" and "def" under it are very uneven. In this case you need to cajole your use-case into working with scatter. Using a prefix match will not help with this because a prefix of 456 is going to get split into "id >= 456 and id < 457" which means the only way to split that up is lexicographical. Instead you can add a property that is "456" to the entities. Then if you do a MR over the keyspace with a filter on the property "newproperty=456" then you can use it as a filter and it will split accurately regardless of how things are distributed by ID.

soundofjw added the question label Oct 16, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specifying a custom KeyRangesIterator for a query #79

Specifying a custom KeyRangesIterator for a query #79

soundofjw commented Oct 16, 2015

tkaitchuck commented Oct 22, 2015

Specifying a custom KeyRangesIterator for a query #79

Specifying a custom KeyRangesIterator for a query #79

Comments

soundofjw commented Oct 16, 2015

tkaitchuck commented Oct 22, 2015