Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[5.7] Optimize Collection::mapWithKeys #24656

Closed
wants to merge 5 commits into from
Closed

[5.7] Optimize Collection::mapWithKeys #24656

wants to merge 5 commits into from

Conversation

dmitrybubyakin
Copy link

Benchmark (100,000 iterations):

Old method: 4.133s
New method: 3.175s
---
Old method: 3.754s
New method: 2.975s
---
Old method: 3.773s
New method: 3.246s

https://gist.github.com/dmitrybubyakin/6bdcc7c0ee7eaffdae699d61017214ee

15-25% faster.

@derekmd
Copy link
Contributor

derekmd commented Jun 21, 2018

This would change the behavior from a merge to an append. Right now it's possible for a later mapWithKeys() iteration to overwrite the key of a previous iteration. This would make the first keyed value to always be used.

Copy link
Contributor

@tillkruss tillkruss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests for overwriting keys.

@vlakoff
Copy link
Contributor

vlakoff commented Jun 22, 2018

Beware of numeric keys, and order of overwriting.

Working solutions might be:

$result = $callback($value, $key) + $result;
$result = array_replace($result, $callback($value, $key));

Note array_replace() seems to just be the same as + with reversed overwriting order, but probably significantly slower (function call overhead).

So, on the principle, why not changing to the + operator, if it's properly benchmarked and tested.

@vlakoff
Copy link
Contributor

vlakoff commented Jun 22, 2018

Refs #16564 and #16552 (comment)

The proposals here seem to support "callback returning several rows" as well.

@vlakoff
Copy link
Contributor

vlakoff commented Jun 22, 2018

More importantly: #16552 (comment)

Also take care of preserving the array order (not the same as the numerical indexes). To verify this, a foreach on the result should iterates the items in the same order.

That would be quite a subtle BC break... I haven't checked the proposal here about this. And tests should be added for this, if there are not yet.

@vlakoff
Copy link
Contributor

vlakoff commented Jun 22, 2018

The added test seems to be worth it, but I guess your new code is slower than current code, isn't it?

@dmitrybubyakin
Copy link
Author

@vlakoff @tillkruss Added tests for overwriting keys.

Benchmark (100,000 iterations):

Old method: 3.782s
New method: 2.757s
---
Old method: 4.678s
New method: 3.075s
---
Old method: 4.173s
New method: 3.014s

28 - 35% now.

@dmitrybubyakin
Copy link
Author

@vlakoff I made some changes, so my new code is faster than the previous code. But there is more memory usage.

@vlakoff
Copy link
Contributor

vlakoff commented Jun 22, 2018

I see about memory usage. This should be measured as well. With your code here, we get a huge array_replace() call with many arguments!

Have you tried my simple $result = $callback($value, $key) + $result;?

@sisve
Copy link
Contributor

sisve commented Jun 22, 2018

While the current implementation says "The callback should return an associative array with a single key/value pair", it supports a callback returning multiple entries. Is this something that is still supported, or are we dropping that support?

@dmitrybubyakin
Copy link
Author

dmitrybubyakin commented Jun 22, 2018

@vlakoff Yes, I have. $result = $callback($value, $key) + $result; is much slower than the old code:

Old method: 3.786s
New method: 27.701s

@sisve We are nothing dropping.

@dmitrybubyakin
Copy link
Author

dmitrybubyakin commented Jun 22, 2018

@vlakoff

https://gist.github.com/dmitrybubyakin/6bdcc7c0ee7eaffdae699d61017214ee

100000 iterations over 100 items, strlen 10:
Old method: time 4.066s, memory: 2.00Mb
New method: time 2.844s, memory: 2.00Mb
----
100000 iterations over 100 items, strlen 100:
Old method: time 4.356s, memory: 2.00Mb
New method: time 3.294s, memory: 2.00Mb
----
100000 iterations over 100 items, strlen 1000:
Old method: time 4.206s, memory: 2.00Mb
New method: time 3.136s, memory: 2.00Mb
----

100 iterations over 1000 items, strlen 1000:
Old method: time 45ms, memory: 4.00Mb
New method: time 34ms, memory: 4.00Mb
----
100 iterations over 10000 items, strlen 1000:
Old method: time 651ms, memory: 22.00Mb
New method: time 647ms, memory: 26.00Mb
----
100 iterations over 100000 items, strlen 1000:
Old method: time 10.186s, memory: 181.50Mb
New method: time 9.437s, memory: 221.51Mb
----

@vlakoff
Copy link
Contributor

vlakoff commented Jun 22, 2018

I tried to reproduce the performance increase, but on my machine your code is actually a bit slower...

$nb = 100;

$data = range(1, 1000);

$callback = function ($value, $key) {
    return [$key => $value];
};


$t1 = microtime(true);
for ($ii = $nb; $ii--; ) {

    $result = [];
    foreach ($data as $key => $value) {
        $assoc = $callback($value, $key);
        foreach ($assoc as $mapKey => $mapValue) {
            $result[$mapKey] = $mapValue;
        }
    }

}
$t2 = microtime(true);
for ($ii = $nb; $ii--; ) {

    $values = array_map($callback, $data, array_keys($data));
    $result = array_replace(...$values);

}
$t3 = microtime(true);

echo $t2 - $t1;
echo "\n";
echo $t3 - $t2;

echo "\n\n";
echo $t2 - $t1 == 0 ? 'N/A' : ($t3 - $t2) * 100 / ($t2 - $t1);

Both methods seem to be O(n).

@dmitrybubyakin
Copy link
Author

@vlakoff on my machine (php 7.2.6):

0.01982307434082
0.012885093688965

65.000481093043%

@vlakoff
Copy link
Contributor

vlakoff commented Jun 22, 2018

Yeah, I had tested on an old PHP 5 box. On PHP 7, both codes are much faster, and yours seems to be faster indeed.

@vlakoff
Copy link
Contributor

vlakoff commented Jun 22, 2018

Again on my box, above a certain amount of items, your code gets suddenly slower. I guess there is some memory exhaustion, which breaks scalability.

@dmitrybubyakin
Copy link
Author

@vlakoff
Copy link
Contributor

vlakoff commented Jun 22, 2018

I can confirm on 3v4l.org that on large data (try 10000, 50000...), your code performs slower and uses more memory. There seems to be a threshold at which execution speed seems to fall down, and memory usage difference seems to gradually increase with data size.

For me scalability is the most important by far, even if it means slower (yet acceptable) execution speed for smaller datasets. Memory usage shouldn't be neglected too.

@vlakoff
Copy link
Contributor

vlakoff commented Jun 22, 2018

Just as a reminder, some possible subtle BC breaks have been highlighted on this thread, and unit tests may be added for these.

@vlakoff
Copy link
Contributor

vlakoff commented Jun 24, 2018

@dmitrybubyakin I suggest you open a new PR, for just adding this unit test concerning numeric keys overwriting.

The other cases I mentioned (callback returning multiple rows, preservation of array order) are actually already covered by tests.

@dmitrybubyakin
Copy link
Author

@vlakoff Ok, I'll submit a new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants