[FEA] Replace numpy mod op in `Series.hash_encode` #935

cmgreen210 · 2019-02-13T20:17:30Z

The current implementation of hash_encode is bottlenecked at the numpy call https://github.com/rapidsai/cudf/blob/branch-0.6/python/cudf/dataframe/series.py#L1078. This computation should be moved to the gpu.

The text was updated successfully, but these errors were encountered:

cmgreen210 · 2019-02-13T20:17:47Z

I will work on a fix for this.

cmgreen210 · 2019-02-13T20:26:18Z

This will be easier than expected. We'll change the line to hashed_values = np.mod(hashed_values.data.to_array(), stop)

In [44]: %time np.mod(hv, 100)
CPU times: user 24.8 s, sys: 404 ms, total: 25.2 s
Wall time: 25.2 s
Out[44]: array([ 9, 69, 85, ..., 87,  1, 22], dtype=int32)

In [45]: %time np.mod(hv.data.to_array(), 100)
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 1.58 ms
Out[45]: array([ 9, 69, 85, ..., 87,  1, 22], dtype=int32)

kkraus14 · 2019-02-13T20:27:03Z

This will be easier than expected. We'll change the line to hashed_values = np.mod(hashed_values.data.to_array(), stop)

In [44]: %time np.mod(hv, 100)
CPU times: user 24.8 s, sys: 404 ms, total: 25.2 s
Wall time: 25.2 s
Out[44]: array([ 9, 69, 85, ..., 87,  1, 22], dtype=int32)

In [45]: %time np.mod(hv.data.to_array(), 100)
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 1.58 ms
Out[45]: array([ 9, 69, 85, ..., 87,  1, 22], dtype=int32)

That will still run it on the CPU and trigger a device to host copy on the .to_array() call.

kkraus14 · 2019-02-13T20:27:55Z

I don't recall if we have a modulo binary op in libcudf that you'd probably want here, @jrhemstad?

cmgreen210 · 2019-02-13T20:28:19Z

This will be easier than expected. We'll change the line to hashed_values = np.mod(hashed_values.data.to_array(), stop)
In [44]: %time np.mod(hv, 100)
CPU times: user 24.8 s, sys: 404 ms, total: 25.2 s
Wall time: 25.2 s
Out[44]: array([ 9, 69, 85, ..., 87,  1, 22], dtype=int32)

In [45]: %time np.mod(hv.data.to_array(), 100)
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 1.58 ms
Out[45]: array([ 9, 69, 85, ..., 87,  1, 22], dtype=int32)
That will still run it on the CPU and trigger a device to host copy on the .to_array() call.

Good point. But the speed up is probably good enough for now. wdyt?

kkraus14 · 2019-02-13T20:31:24Z

This will be easier than expected. We'll change the line to hashed_values = np.mod(hashed_values.data.to_array(), stop)
In [44]: %time np.mod(hv, 100)
CPU times: user 24.8 s, sys: 404 ms, total: 25.2 s
Wall time: 25.2 s
Out[44]: array([ 9, 69, 85, ..., 87,  1, 22], dtype=int32)

In [45]: %time np.mod(hv.data.to_array(), 100)
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 1.58 ms
Out[45]: array([ 9, 69, 85, ..., 87,  1, 22], dtype=int32)
That will still run it on the CPU and trigger a device to host copy on the .to_array() call.
Good point. But the speed up is probably good enough for now. wdyt?

Implementing a binary modulo op on the GPU should be trivial. If the arrays are larger or we call this on a bunch of columns the allocations / memory copies would become expensive.

jrhemstad · 2019-02-13T20:36:10Z

I don't recall if we have a modulo binary op in libcudf that you'd probably want here, @jrhemstad?

So you want a binary op of doing a modulo of a column by a scalar? This will be added in #892

cmgreen210 · 2019-02-13T21:11:47Z

@jrhemstad yes. Will this be added by the 0.6 release? If not might want to do an intermediate fix.

kkraus14 · 2019-02-13T22:38:05Z

@jrhemstead yes. Will this be added by the 0.6 release? If not might want to do an intermediate fix.

@cmgreen210 as a stop gap it should be pretty straightforward to write a numba kernel that does a modulo operator until we move it into libcudf.

cmgreen210 · 2019-02-13T22:52:26Z

@jrhemstead yes. Will this be added by the 0.6 release? If not might want to do an intermediate fix.

@cmgreen210 as a stop gap it should be pretty straightforward to write a numba kernel that does a modulo operator until we move it into libcudf.

@kkraus14 Sounds good. I'll have that up asap.

devavret · 2019-02-18T15:03:29Z

The modulo implementation is now here. It's not merged, but it's tested to work.

kkraus14 added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Feb 13, 2019

cmgreen210 mentioned this issue Feb 13, 2019

[REVIEW] Add numba modulo kernel for hash_encode #941

Merged

kkraus14 closed this as completed in #941 Feb 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Replace numpy mod op in `Series.hash_encode` #935

[FEA] Replace numpy mod op in `Series.hash_encode` #935

cmgreen210 commented Feb 13, 2019

cmgreen210 commented Feb 13, 2019

cmgreen210 commented Feb 13, 2019

kkraus14 commented Feb 13, 2019

kkraus14 commented Feb 13, 2019

cmgreen210 commented Feb 13, 2019

kkraus14 commented Feb 13, 2019

jrhemstad commented Feb 13, 2019

cmgreen210 commented Feb 13, 2019 •

edited

Loading

kkraus14 commented Feb 13, 2019

cmgreen210 commented Feb 13, 2019

devavret commented Feb 18, 2019

[FEA] Replace numpy mod op in Series.hash_encode #935

[FEA] Replace numpy mod op in Series.hash_encode #935

Comments

cmgreen210 commented Feb 13, 2019

cmgreen210 commented Feb 13, 2019

cmgreen210 commented Feb 13, 2019

kkraus14 commented Feb 13, 2019

kkraus14 commented Feb 13, 2019

cmgreen210 commented Feb 13, 2019

kkraus14 commented Feb 13, 2019

jrhemstad commented Feb 13, 2019

cmgreen210 commented Feb 13, 2019 • edited Loading

kkraus14 commented Feb 13, 2019

cmgreen210 commented Feb 13, 2019

devavret commented Feb 18, 2019

[FEA] Replace numpy mod op in `Series.hash_encode` #935

[FEA] Replace numpy mod op in `Series.hash_encode` #935

cmgreen210 commented Feb 13, 2019 •

edited

Loading