Support multiple axes #31

thomasahle · 2021-12-05T08:37:18Z

I want to reduce Hadamard Transform every column in my matrix.
Is there currently a way to do this with FFHT without using a python outer loop?

W-Wuxian · 2022-01-21T11:29:39Z

Hi, I am also interested by a blocked FFHT. For instance, I am using a dumb FFHT with a rotation kernel based on the cblas?rotm_ (code below) witch allow me to compute 1-D Hadamard transform of a block vector W (W is m x bs). But in fact cblas?rotm_ is considering data as vector of length 2, not of length bs with leads to poor performance.

void _preAlps_ROTATEData_MAT_d(double* data, int nIn, int ncols, unsigned c1, unsigned c2,
 Timing_FWHT_t* TimerFWHT){
  TimerFWHT->Level3TrashTime = MPI_Wtime();
  // param := flag, h11, h21, h12, h22
  static const double rotatedata_mat_param[] = {-1.0, 1.0, 1.0, 1.0, -1.0};
  // Alias
  double* x = &data[c1];
  double* y = &data[c2];
  TimerFWHT->PrepareROTATEDataTime += MPI_Wtime() - TimerFWHT->Level3TrashTime;
  TimerFWHT->Level3TrashTime = MPI_Wtime();
  // Performs modified Givens rotation of points in the plane
  cblas_drotm(ncols, x, nIn, y, nIn, rotatedata_mat_param);
  TimerFWHT->ROTATEDataTime += MPI_Wtime() - TimerFWHT->Level3TrashTime;
}
int preAlps_FWHT_MAT_d( double* data, unsigned n, int ncols,
 Timing_FWHT_t* TimerFWHT) {
  TimerFWHT->FWHTSample += 1;
  TimerFWHT->Level1TrashTime = MPI_Wtime();
  int ierr = 0;
  // Require localsize to be a power of 2
  unsigned localsize = n;
  unsigned l2 = preAlps_ILOG2(localsize) - 1;
  unsigned l3 = 1 << l2;
  if (localsize != l3) {
    CPLM_Abort("Data row length should be a power of 2,"
	       " localsize: %u ; l3: %u", localsize, l3);
  }
  // Compute the FWHT
  unsigned tmp;
  for (unsigned i = 0; i<l2; ++i) {
    l3 = (unsigned)(1 << i);
    for (unsigned j = 0; j < localsize; j += (1 << (i+1))) {
      for (unsigned k = 0; k < l3; ++k) {
        tmp = j + k;
        TimerFWHT->Level2TrashTime = MPI_Wtime();
        _preAlps_ROTATEData_MAT_d(data, localsize, ncols, tmp, tmp+l3, TimerFWHT);
        TimerFWHT->FUNROTATEDataTime += MPI_Wtime() - TimerFWHT->Level2TrashTime;
      }
    }
  }
  TimerFWHT->FWHTTotalTime += MPI_Wtime() - TimerFWHT->Level1TrashTime;
  return ierr;
}

I did a benchmark using np=8 mpi process (computer is intel) to compute a 1-D Blocked Fast Walsh Hadamard transform of a Matrix W of size m x n. W is Column Major and distributed by rows on each mpi process. The global size is denoted by m=np*mp and the local size is denoted by mp=6.25E+5. I used m=5E+6 and n=256 and applied a zero-padding on mp. The block size is denoted by bs. I am computing the 1-D block Fast Walsh Hadamard transform (almost like) as the following:

int remainder = n%bs;
int coef = (remainder==0) ? n/bs : ((n - remainder) / bs) + 1;
int curbs = 0; /**< current block size */
for (int j = 0; j < coef; ++j){
  curbs = (remainder==0) ? bs : ((j<coef-1) ? bs : remainder);
  preAlps_FWHT_MAT_d( *(W+j*mp*bs), mp, curbs); 
}

who gave me the following plot:
TpsBSRHT_d.pdf
We should only focus on blue bars (FWHT) and the bs facet parameters. From the plot TpsBSRHT_d.pdf, we can see that the computation time decreases for intermediate block size (bs), probably due to the cache effect. But from my point of view the computation time is still terrible!?

So my point is could you update the FFHT lib in order to allow yours users to used blocked FFHT/FWHT? Or help me to optimize my code? From your home page, I am targetting bs*0.50 sec for m=2^27 as computation time. (TY)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multiple axes #31

Support multiple axes #31

thomasahle commented Dec 5, 2021

W-Wuxian commented Jan 21, 2022 •

edited

Loading

Support multiple axes #31

Support multiple axes #31

Comments

thomasahle commented Dec 5, 2021

W-Wuxian commented Jan 21, 2022 • edited Loading

W-Wuxian commented Jan 21, 2022 •

edited

Loading