Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Gluon's PReLU is very slow and a fix to it #10972

Closed
rocketbear opened this issue May 16, 2018 · 3 comments · Fixed by #11012
Closed

Gluon's PReLU is very slow and a fix to it #10972

rocketbear opened this issue May 16, 2018 · 3 comments · Fixed by #11012

Comments

@rocketbear
Copy link

I have experienced significantly slow training speed when using PReLU activation instead of using ReLU activation with model composed using the gluon API. The speed of PReLU over ReLU is about 1/5 on GPU when measuring the number of samples processed per second.

Finally I managed to bring the speed of gluon's PReLU back to normal by the following modifications:
Following is the original init function of PReLU:

def __init__(self, alpha_initializer=initializer.Constant(0.25), **kwargs):
    super(PReLU, self).__init__(**kwargs)
    with self.name_scope():
        self.alpha = self.params.get('alpha', shape=(1,), init=alpha_initializer)

Following is the modified init function of PReLU:

def __init__(self, in_channels=1, alpha_initializer=initializer.Constant(0.25), **kwargs):
    super(PReLU, self).__init__(**kwargs)
    with self.name_scope():
        self.alpha = self.params.get('alpha', shape=(in_channels,), init=alpha_initializer)

The key is to pass in the expected number of channels to the PReLU block, so that it does not share the negative slope among the channels. The downside of this solution is that you need to pass in the number of channels every time.

I don't know why the two settings (shared vs. non-shared) have so drastically different performance. The contributors of mxnet should investigate this issue in further depth.

@szha
Copy link
Member

szha commented May 16, 2018

These two implementations are different in terms of the number of parameters. The performance hit likely comes from the broadcast operation.

@chinakook
Copy link
Contributor

Yes, I investigated the leakyrelu-inl.h source code. There are indeed a broadcast operation when the shape of param 'alpha' is 1.

@chinakook
Copy link
Contributor

I think it's no need to broadcast when multiply a scalar and a matrix. May some other operation is more suitable for this kind of multiplication.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants