Change 'in' (and '!in') filter to work with a hash for performance. #9

dcervelli · 2016-01-27T19:02:46Z

The 'in' filter was previously chaining if statements leading to
O(n*m) running time if using the returned filter function while
iterating over a list of n features with m items in the filter.

This change puts the 'in' arguments into a simple javascript hash
for O(n) time.

tmcw · 2016-01-27T19:26:07Z

index.js

+            var values = {};
+            Array.prototype.slice.call(arguments, 3).map(function(value) {
+                if (key === '$type') {
+                    values[VectorTileFeatureTypes.indexOf(value)] = true;


this should be a .forEach if it doesn't return anything, or a for loop. or a reduce if you want it to be functional.

jfirebaugh · 2016-01-27T19:27:35Z

Thanks for the contribution!

This is a change that needs to be benchmarked for both simple and complex filters. Simple filters are extremely common and we need to make sure not to deoptimize the performance of a filter like ["in", "key", "a", "b"].

The 'in' filter was previously chaining if statements leading to O(n*m) running time if using the returned filter function while iterating over a list of n features with m items in the filter. This change puts the 'in' arguments into a simple javascript hash for O(n) time. To prevent perf regressions for small filters we use the original code path below a hard-coded threshold (currently set at 30 based on rudimentary perf testing).

dcervelli · 2016-01-27T23:12:39Z

I changed the map to forEach.

Agreed on avoiding perf regressions. I added a hard-coded constant (currently 30, see below) below which we do the old unrolled if thing. Otherwise we use the map. I arrived at 30 using the following rudimentary test:

var ff = require('./index.js');

var n = 1000000;
var classes = [];
var features = [];
for (var i = 0; i < n; i++) {
    var c = "c" + i;
    classes.push(c);
    var feature = {
        type: 1,
        properties: {
            class: c
        }
    };
    features.push(feature);
}

var runs = 50;
var testFilterFunction = function(text, filter) {
    console.time(text);
    for (var r = 0; r < runs; r++) {
        var testFilter = ff(filter);
        var matchCount = 0;
        for (var i = 0; i < n; i++) {
            if (testFilter(features[i])) {
                matchCount++;
            }
        }
    }
    console.timeEnd(text);
    console.log(matchCount);
};

var test = function(subN) {
    var filter = ["in", "class"].concat(classes.slice(0, subN));
    testFilterFunction(subN + "/" + n, filter);
};

test(1);
test(29); // Unrolled
test(30); // Map lookup
test(100);
test(250);

with the following results (29 is unrolled, 30 is map lookup):

// this branch
1/1000000: 2042.210ms
1
29/1000000: 12230.287ms
29
30/1000000: 13818.584ms
30
100/1000000: 15005.090ms
100
250/1000000: 15625.071ms
250

// Switched to branch 'master'
1/1000000: 1799.886ms
1
29/1000000: 12280.264ms
29
30/1000000: 12970.570ms
30
100/1000000: 40195.818ms
100
250/1000000: 100073.038ms
250

I did further perf testing around the 1-5 range and found neither implementation to be consistently faster than the other.

mourner · 2016-01-28T22:22:31Z

@dcervelli @jfirebaugh Suddenly got an idea — what do you think about keeping the array form, BUT sorting it when generating code, and doing a simple binary search instead of indexOf in the checking code? This should be much simpler in implementation while being more performant and less memory-hungry in most cases while not degrading (not sure but gut feeling).

dcervelli · 2016-01-28T22:33:26Z

@dcervelli @jfirebaugh Suddenly got an idea — what do you think about keeping the array form, BUT sorting it when generating code, and doing a simple binary search instead of indexOf in the checking code? This should be much simpler in implementation while being more performant and less memory-hungry in most cases while not degrading (not sure but gut feeling).

That's definitely an improvement from the original code (although we'd need to worry about perf regression in the filter construction case, not sure if that's important). I don't think there's any net memory savings between the two methods because all of the values are in memory somewhere (code is memory too!) -- my hunch would be that the map uses less memory than the unrolled loop. Of course, the map gets you O(n) instead of O(n lg n) with the binary search. On the whole I still prefer the map option.

The biggest argument against taking this pull request is that increases the complexity significantly, which I acknowledge. That said, this module does one thing at a low level and the complexity doesn't leak out, so I don't worry about it.

Also, I'll file an issue on this too, but there's another problem that we will likely need to fix which is that this function errors out with more than 64k elements in the set (stack overflow because they are all passed as args).

mourner · 2016-01-28T22:58:08Z

Of course, the map gets you O(n) instead of O(n lg n) with the binary search.

O(log n) can be faster than O(1) because of constant factors involved. E.g. dictionary mode in V8 is known to be slow. So I'd like to make a proof of concept implementation first and then benchmark to be sure.

I don't think there's any net memory savings between the two methods because all of the values are in memory somewhere

The same article says "Hash tables in V8 are large arrays containing keys and values.". An array would need just the keys and have 2 times less elements. So an array would take slightly less memory probably.

dcervelli · 2016-01-28T23:16:04Z

O(log n) can be faster than O(1) because of constant factors involved. E.g. dictionary mode in V8 is known to be slow. So I'd like to make a proof of concept implementation first and then benchmark to be sure.

Makes sense.

The same article says "Hash tables in V8 are large arrays containing keys and values.". An array would need just the keys and have 2 times less elements. So an array would take slightly less memory probably.

Oops, of course this makes sense. For some reason I was assuming that you were suggesting it would work similarly to how it worked before with unrolled if statements. An unrolled binary search generator sounds awful to write but I suppose it could be done :-)

mourner · 2016-01-29T08:17:52Z

@dcervelli yeah, unrolled is really bad on many items in all senses. I'll make a binary search PR and we'll compare.

mourner · 2016-01-29T12:03:35Z

@dcervelli just a quick note — compilation time is not nearly as important as filter time, because compilation happens only once for each filter on a style change, and filter runs hundreds of thousands of times (for each map feature). So compilation shouldn't be a part of the main benchmark.

mourner · 2016-01-29T16:56:43Z

Closing in favor of #12.

tmcw reviewed Jan 27, 2016
View reviewed changes

dcervelli force-pushed the in-as-set branch from 1907c8d to 2054d40 Compare January 27, 2016 23:04

dcervelli mentioned this pull request Jan 28, 2016

Allow GeoJSON to be passed as stringify JSON. mapbox/mapbox-gl-js#2001

Closed

mourner mentioned this pull request Jan 29, 2016

Faster and simpler filter implementation #12

Merged

mourner closed this Jan 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change 'in' (and '!in') filter to work with a hash for performance. #9

Change 'in' (and '!in') filter to work with a hash for performance. #9

dcervelli commented Jan 27, 2016

tmcw Jan 27, 2016

jfirebaugh commented Jan 27, 2016

dcervelli commented Jan 27, 2016

mourner commented Jan 28, 2016

dcervelli commented Jan 28, 2016

mourner commented Jan 28, 2016

dcervelli commented Jan 28, 2016

mourner commented Jan 29, 2016

mourner commented Jan 29, 2016

mourner commented Jan 29, 2016

Change 'in' (and '!in') filter to work with a hash for performance. #9

Change 'in' (and '!in') filter to work with a hash for performance. #9

Conversation

dcervelli commented Jan 27, 2016

tmcw Jan 27, 2016

Choose a reason for hiding this comment

jfirebaugh commented Jan 27, 2016

dcervelli commented Jan 27, 2016

mourner commented Jan 28, 2016

dcervelli commented Jan 28, 2016

mourner commented Jan 28, 2016

dcervelli commented Jan 28, 2016

mourner commented Jan 29, 2016

mourner commented Jan 29, 2016

mourner commented Jan 29, 2016