-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multiple axes #31
Comments
Hi, I am also interested by a blocked FFHT. For instance, I am using a dumb FFHT with a rotation kernel based on the cblas?rotm_ (code below) witch allow me to compute 1-D Hadamard transform of a block vector W (W is m x bs). But in fact cblas?rotm_ is considering data as vector of length 2, not of length bs with leads to poor performance. void _preAlps_ROTATEData_MAT_d(double* data, int nIn, int ncols, unsigned c1, unsigned c2,
Timing_FWHT_t* TimerFWHT){
TimerFWHT->Level3TrashTime = MPI_Wtime();
// param := flag, h11, h21, h12, h22
static const double rotatedata_mat_param[] = {-1.0, 1.0, 1.0, 1.0, -1.0};
// Alias
double* x = &data[c1];
double* y = &data[c2];
TimerFWHT->PrepareROTATEDataTime += MPI_Wtime() - TimerFWHT->Level3TrashTime;
TimerFWHT->Level3TrashTime = MPI_Wtime();
// Performs modified Givens rotation of points in the plane
cblas_drotm(ncols, x, nIn, y, nIn, rotatedata_mat_param);
TimerFWHT->ROTATEDataTime += MPI_Wtime() - TimerFWHT->Level3TrashTime;
}
int preAlps_FWHT_MAT_d( double* data, unsigned n, int ncols,
Timing_FWHT_t* TimerFWHT) {
TimerFWHT->FWHTSample += 1;
TimerFWHT->Level1TrashTime = MPI_Wtime();
int ierr = 0;
// Require localsize to be a power of 2
unsigned localsize = n;
unsigned l2 = preAlps_ILOG2(localsize) - 1;
unsigned l3 = 1 << l2;
if (localsize != l3) {
CPLM_Abort("Data row length should be a power of 2,"
" localsize: %u ; l3: %u", localsize, l3);
}
// Compute the FWHT
unsigned tmp;
for (unsigned i = 0; i<l2; ++i) {
l3 = (unsigned)(1 << i);
for (unsigned j = 0; j < localsize; j += (1 << (i+1))) {
for (unsigned k = 0; k < l3; ++k) {
tmp = j + k;
TimerFWHT->Level2TrashTime = MPI_Wtime();
_preAlps_ROTATEData_MAT_d(data, localsize, ncols, tmp, tmp+l3, TimerFWHT);
TimerFWHT->FUNROTATEDataTime += MPI_Wtime() - TimerFWHT->Level2TrashTime;
}
}
}
TimerFWHT->FWHTTotalTime += MPI_Wtime() - TimerFWHT->Level1TrashTime;
return ierr;
} I did a benchmark using np=8 mpi process (computer is intel) to compute a 1-D Blocked Fast Walsh Hadamard transform of a Matrix W of size m x n. W is Column Major and distributed by rows on each mpi process. The global size is denoted by m=np*mp and the local size is denoted by mp=6.25E+5. I used m=5E+6 and n=256 and applied a zero-padding on mp. The block size is denoted by bs. I am computing the 1-D block Fast Walsh Hadamard transform (almost like) as the following: int remainder = n%bs;
int coef = (remainder==0) ? n/bs : ((n - remainder) / bs) + 1;
int curbs = 0; /**< current block size */
for (int j = 0; j < coef; ++j){
curbs = (remainder==0) ? bs : ((j<coef-1) ? bs : remainder);
preAlps_FWHT_MAT_d( *(W+j*mp*bs), mp, curbs);
} who gave me the following plot: So my point is could you update the FFHT lib in order to allow yours users to used blocked FFHT/FWHT? Or help me to optimize my code? From your home page, I am targetting bs*0.50 sec for m=2^27 as computation time. (TY) |
I want to reduce Hadamard Transform every column in my matrix.
Is there currently a way to do this with FFHT without using a python outer loop?
The text was updated successfully, but these errors were encountered: