Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFactor: bulkerify GEMV (both MC and GPU) #1214

Draft
wants to merge 39 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
b1197be
factor out gemvLoop for MC
albestro Nov 14, 2024
9647ec8
mc-local: add stepGEMVAll with bulk using workspaces
albestro Nov 14, 2024
444ca31
mc-dist: add stepGEMVAll with bulk using workspaces
albestro Nov 14, 2024
4d57625
bulk for gpu (adapt mc to use one less workspace)
albestro Nov 15, 2024
4e745e2
add missing headers
albestro Nov 18, 2024
6af3b87
use gpulapack wrappers for lacpy
albestro Nov 18, 2024
026061e
introduce parameter for tfactor nworkers
albestro Nov 27, 2024
aee941a
nest workspace creation inside the actual algorithm
albestro Nov 27, 2024
9369805
nest even more, closer to the point of usage
albestro Nov 27, 2024
9828364
basic change to the logic for number of workers
albestro Nov 27, 2024
a595e42
fix documentation as started in #798
albestro Nov 27, 2024
b3d6223
fix typo
albestro Nov 27, 2024
37bc669
snake case renaming
albestro Dec 5, 2024
27de9ec
fix: get_tfactor_nworkers was out of place
albestro Dec 5, 2024
27978c9
wip: change api accepting an external workspace
albestro Dec 5, 2024
6765ccb
adapt algorithms to new api with external workspace
albestro Dec 5, 2024
ff577cd
adapt also the test to the new api
albestro Dec 5, 2024
f5dadcf
basic check and doc for workspace
albestro Dec 5, 2024
c5f4d75
fix inshpect
albestro Dec 5, 2024
a967d6c
use lowercase in snake_case
albestro Dec 5, 2024
a6db55e
snake_case also for config getter
albestro Dec 6, 2024
d4cd568
do not expose internal header
albestro Dec 9, 2024
f3c0ba0
Update include/dlaf/tune.h
albestro Dec 9, 2024
89d72f7
remove superfluous spack
albestro Dec 11, 2024
f1719fa
add missing cli option for dlaf:tfactor-nworkers
albestro Dec 11, 2024
87c952b
bug fix: tfactor workspaces allocation
albestro Dec 11, 2024
3d3b93c
bug fix: replace sender workspace instead of appending
albestro Dec 11, 2024
92aa5b7
bug fix: missing default initialization to a proper value
albestro Dec 13, 2024
eb9def0
bug fix: diag has to be reset at the end
albestro Dec 16, 2024
202f1cc
bug fix: not all workspaces should be used
albestro Dec 16, 2024
c0a7660
bug fix: race-condition on k using tile_t
albestro Dec 16, 2024
dec050a
avoid moves by using const& in loop_gemv
albestro Dec 17, 2024
fc12aae
minor changes
albestro Dec 17, 2024
3f2a53a
minor change: drop unused step_trmv and reorder gpu helpers
albestro Dec 17, 2024
5997e52
minor changes and cleanup to doc
albestro Dec 17, 2024
383dde5
bug fix: worker might not get used anyway
albestro Dec 17, 2024
5f2ddd4
bug fix: workspace for reduction to band was too big.
albestro Dec 19, 2024
ffd3135
bug fix: workspace in bt_red2band migth have not been reset
albestro Dec 19, 2024
d199b88
fix conflict
albestro Dec 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 30 additions & 4 deletions include/dlaf/eigensolver/bt_reduction_to_band/impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
#include <dlaf/communication/kernels.h>
#include <dlaf/eigensolver/bt_reduction_to_band/api.h>
#include <dlaf/factorization/qr.h>
#include <dlaf/factorization/qr/internal/get_tfactor_nworkers.h>
#include <dlaf/matrix/copy.h>
#include <dlaf/matrix/copy_tile.h>
#include <dlaf/matrix/distribution.h>
Expand Down Expand Up @@ -134,6 +135,7 @@ void BackTransformationReductionToBand<backend, device, T>::call(
const SizeType b, MatrixRef<T, device>& mat_c, Matrix<const T, device>& mat_v,
Matrix<const T, Device::CPU>& mat_taus) {
using namespace bt_red_band;
using dlaf::factorization::internal::computeTFactor;

auto hp = pika::execution::thread_priority::high;
auto np = pika::execution::thread_priority::normal;
Expand Down Expand Up @@ -162,6 +164,14 @@ void BackTransformationReductionToBand<backend, device, T>::call(
dlaf::matrix::Distribution dist_t({mb, total_nr_reflector}, {mb, mb});
matrix::Panel<Coord::Row, T, device> panelT(dist_t);

const auto dist_ws = [&]() {
using dlaf::factorization::internal::get_tfactor_nworkers;
const SizeType nworkspaces = to_SizeType(std::max<std::size_t>(0, get_tfactor_nworkers() - 1));
const SizeType nrefls_step = dist_v.tile_size().cols();
return matrix::Distribution{{nworkspaces * nrefls_step, nrefls_step}, {nrefls_step, nrefls_step}};
}();
common::RoundRobin<matrix::Panel<Coord::Col, T, device>> panelsWS(n_workspaces, dist_ws);

const SizeType nr_reflector_blocks = dist_t.nrTiles().cols();

for (SizeType k = nr_reflector_blocks - 1; k >= 0; --k) {
Expand All @@ -174,6 +184,7 @@ void BackTransformationReductionToBand<backend, device, T>::call(
const matrix::SubPanelView panel_view(dist_v, v_offset, nr_reflectors);
const matrix::SubMatrixView mat_c_view(dist_c, c_offset);

auto& panelWS = panelsWS.nextResource();
auto& panelV = panelsV.nextResource();
auto& panelW = panelsW.nextResource();
auto& panelW2 = panelsW2.nextResource();
Expand All @@ -182,6 +193,7 @@ void BackTransformationReductionToBand<backend, device, T>::call(
panelW.setRangeStart(v_offset);

if (is_last) {
panelWS.setWidth(nr_reflectors);
panelT.setHeight(nr_reflectors);
panelW2.setHeight(nr_reflectors);
panelW.setWidth(nr_reflectors);
Expand All @@ -207,8 +219,10 @@ void BackTransformationReductionToBand<backend, device, T>::call(

const LocalTileIndex taus_index{Coord::Row, k};
const LocalTileIndex t_index{Coord::Col, k};
dlaf::factorization::internal::computeTFactor<backend>(panelV, mat_taus.read(taus_index),
panelT.readwrite(t_index));

computeTFactor<backend>(panelV, mat_taus.read(taus_index), panelT.readwrite(t_index),
select(panelWS, panelWS.iteratorLocal()));
panelWS.reset();

// W = V T
auto tile_t = panelT.read(t_index);
Expand Down Expand Up @@ -242,6 +256,7 @@ void BackTransformationReductionToBand<B, D, T>::call(comm::CommunicatorGrid& gr
Matrix<const T, Device::CPU>& mat_taus) {
namespace ex = pika::execution::experimental;
using namespace bt_red_band;
using dlaf::factorization::internal::computeTFactor;

auto hp = pika::execution::thread_priority::high;
auto np = pika::execution::thread_priority::normal;
Expand Down Expand Up @@ -277,6 +292,14 @@ void BackTransformationReductionToBand<B, D, T>::call(comm::CommunicatorGrid& gr
dist_v.sourceRankIndex());
matrix::Panel<Coord::Row, T, D> panelT(dist_t);

const auto dist_ws = [&]() {
using dlaf::factorization::internal::get_tfactor_nworkers;
const SizeType nworkspaces = to_SizeType(std::max<std::size_t>(0, get_tfactor_nworkers() - 1));
const SizeType nrefls_step = dist_v.tile_size().cols();
return matrix::Distribution{{nworkspaces * nrefls_step, nrefls_step}, {nrefls_step, nrefls_step}};
}();
common::RoundRobin<matrix::Panel<Coord::Col, T, D>> panelsWS(n_workspaces, dist_ws);

const SizeType nr_reflector_blocks = dist_t.nrTiles().cols();

for (SizeType k = nr_reflector_blocks - 1; k >= 0; --k) {
Expand All @@ -289,6 +312,7 @@ void BackTransformationReductionToBand<B, D, T>::call(comm::CommunicatorGrid& gr
const matrix::SubPanelView panel_view(dist_v, v_offset, nr_reflectors);
const matrix::SubMatrixView mat_c_view(dist_c, c_offset);

auto& panelWS = panelsWS.nextResource();
auto& panelV = panelsV.nextResource();
auto& panelW = panelsW.nextResource();
auto& panelW2 = panelsW2.nextResource();
Expand All @@ -298,6 +322,7 @@ void BackTransformationReductionToBand<B, D, T>::call(comm::CommunicatorGrid& gr
panelT.setRange(GlobalTileIndex(Coord::Col, k), GlobalTileIndex(Coord::Col, k + 1));

if (is_last) {
panelWS.setWidth(nr_reflectors);
panelT.setHeight(nr_reflectors);
panelW2.setHeight(nr_reflectors);
panelW.setWidth(nr_reflectors);
Expand Down Expand Up @@ -329,8 +354,8 @@ void BackTransformationReductionToBand<B, D, T>::call(comm::CommunicatorGrid& gr
const GlobalTileIndex taus_index{Coord::Row, k};
const SizeType k_local = dist_t.template localTileFromGlobalTile<Coord::Col>(k);
const LocalTileIndex t_index{Coord::Col, k_local};
dlaf::factorization::internal::computeTFactor<B>(panelV, mat_taus.read(taus_index),
panelT.readwrite(t_index), mpi_col_task_chain);
computeTFactor<B>(panelV, mat_taus.read(taus_index), panelT.readwrite(t_index),
select(panelWS, panelWS.iteratorLocal()), mpi_col_task_chain);

// WH = V T
for (const auto& idx : panel_view.iteratorLocal()) {
Expand Down Expand Up @@ -362,6 +387,7 @@ void BackTransformationReductionToBand<B, D, T>::call(comm::CommunicatorGrid& gr
splitTile(mat_c.readwrite(ij), mat_c_view(ij)));
}

panelWS.reset();
panelV.reset();
panelW.reset();
panelW2.reset();
Expand Down
32 changes: 29 additions & 3 deletions include/dlaf/eigensolver/reduction_to_band/impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
#include <dlaf/eigensolver/internal/get_red2band_panel_nworkers.h>
#include <dlaf/eigensolver/reduction_to_band/api.h>
#include <dlaf/factorization/qr.h>
#include <dlaf/factorization/qr/internal/get_tfactor_nworkers.h>
#include <dlaf/lapack/tile.h>
#include <dlaf/matrix/copy_tile.h>
#include <dlaf/matrix/distribution.h>
Expand Down Expand Up @@ -1018,6 +1019,14 @@ Matrix<T, Device::CPU> ReductionToBand<B, D, T>::call(Matrix<T, D>& mat_a, const
common::RoundRobin<Panel<Coord::Col, T, D>> panels_w(n_workspaces, dist);
common::RoundRobin<Panel<Coord::Col, T, D>> panels_x(n_workspaces, dist);

const auto dist_ws = [&]() {
using dlaf::factorization::internal::get_tfactor_nworkers;
const SizeType nworkspaces = to_SizeType(std::max<std::size_t>(0, get_tfactor_nworkers() - 1));
const SizeType nrefls_step = dist.tile_size().cols();
return matrix::Distribution{{nworkspaces * nrefls_step, nrefls_step}, {nrefls_step, nrefls_step}};
}();
common::RoundRobin<matrix::Panel<Coord::Col, T, D>> panels_ws(n_workspaces, dist_ws);

// Note:
// Here dist_a is given with full panel size instead of dist with just the part actually needeed,
// because the GPU Helper internally exploits Panel data-structure. Indeed, the full size panel is
Expand All @@ -1043,10 +1052,13 @@ Matrix<T, Device::CPU> ReductionToBand<B, D, T>::call(Matrix<T, D>& mat_a, const
// reflectors (i.e. at the end when last reflector size is 1)
const matrix::SubPanelView panel_view(dist_a, ij_offset, band_size);

Panel<Coord::Col, T, D>& ws = panels_ws.nextResource();
Panel<Coord::Col, T, D>& v = panels_v.nextResource();
v.setRangeStart(ij_offset);
if (isPanelIncomplete)
if (isPanelIncomplete) {
ws.setWidth(nrefls_tile);
v.setWidth(nrefls_tile);
}

// PANEL
compute_panel_helper.call(mat_a, mat_taus_retiled, j_sub, panel_view);
Expand All @@ -1063,7 +1075,9 @@ Matrix<T, Device::CPU> ReductionToBand<B, D, T>::call(Matrix<T, D>& mat_a, const
// TODO probably the first one in any panel is ok?
Matrix<T, D> t({nrefls_tile, nrefls_tile}, dist.blockSize());

computeTFactor<B>(v, mat_taus_retiled.read(GlobalTileIndex(j_sub, 0)), t.readwrite(t_idx));
computeTFactor<B>(v, mat_taus_retiled.read(GlobalTileIndex(j_sub, 0)), t.readwrite(t_idx),
select(ws, ws.iteratorLocal()));
ws.reset();

// PREPARATION FOR TRAILING MATRIX UPDATE
const GlobalElementIndex at_offset(ij_offset + GlobalElementSize(0, band_size));
Expand Down Expand Up @@ -1205,6 +1219,14 @@ Matrix<T, Device::CPU> ReductionToBand<B, D, T>::call(comm::CommunicatorGrid& gr
common::RoundRobin<matrix::Panel<Coord::Row, T, D, matrix::StoreTransposed::Yes>> panels_xt(
n_workspaces, dist);

const auto dist_ws = [&]() {
using dlaf::factorization::internal::get_tfactor_nworkers;
const SizeType nworkspaces = to_SizeType(std::max<std::size_t>(0, get_tfactor_nworkers() - 1));
const SizeType nrefls_step = band_size;
return matrix::Distribution{{nworkspaces * nrefls_step, nrefls_step}, {nrefls_step, nrefls_step}};
}();
common::RoundRobin<matrix::Panel<Coord::Col, T, D>> panels_ws(n_workspaces, dist_ws);

red2band::ComputePanelHelper<B, D, T> compute_panel_helper(n_workspaces, dist);

ex::unique_any_sender<> trigger_panel{ex::just()};
Expand All @@ -1228,10 +1250,12 @@ Matrix<T, Device::CPU> ReductionToBand<B, D, T>::call(comm::CommunicatorGrid& gr

auto& v = panels_v.nextResource();
auto& vt = panels_vt.nextResource();
auto& ws = panels_ws.nextResource();

v.setRangeStart(at_offset);
vt.setRangeStart(at_offset);

ws.setWidth(nrefls_tile);
v.setWidth(nrefls_tile);
vt.setHeight(nrefls_tile);

Expand All @@ -1253,8 +1277,10 @@ Matrix<T, Device::CPU> ReductionToBand<B, D, T>::call(comm::CommunicatorGrid& gr
// deadlock due to tile shared between panel and trailing matrix
red2band::local::setupReflectorPanelV<B, D, T>(rank.row() == rank_v0.row(), panel_view,
nrefls_tile, v, mat_a, !is_full_band);

computeTFactor<B>(v, mat_taus_retiled.read(GlobalTileIndex(j_sub, 0)), t.readwrite(t_idx),
mpi_col_chain);
select(ws, ws.iteratorLocal()), mpi_col_chain);
ws.reset();
}

// PREPARATION FOR TRAILING MATRIX UPDATE
Expand Down
71 changes: 65 additions & 6 deletions include/dlaf/factorization/qr.h
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
#pragma once

#include <utility>
#include <vector>

/// @file

Expand All @@ -33,26 +34,84 @@ namespace dlaf::factorization::internal {
/// H0 H1 H2 ... HK-1
/// Note: The first element of the HH reflectors is NOT implicitly assumed to be 1,
/// it has to be set correctly in the panel (0s as well).

///
/// It is similar to what xLARFT in LAPACK does.
/// Given @p k elementary reflectors stored in the column of @p hh_panel together with related tau values
/// in @p taus, in @p t will be formed the triangular factor for the H block of reflectors, such that
///
/// H = I - V . T . V*
///
/// where H = H1 . H2 . ... . Hk
///
/// in which Hi represents a single elementary reflector transformation.
///
/// A Storage-Efficient WY Representation for Products of Householder Transformations.
/// Schreiber, Robert & VanLoan, Charles. (1989)
/// SIAM Journal on Scientific and Statistical Computing. 10. 10.1137/0910005.
///
/// @pre taus contains a vector with k elements
/// @pre t contains a (k x k) tile
/// @param hh_panel where the elementary reflectors are stored
/// @param taus array of taus, associated with the related elementary reflector
/// @param t tile where the resulting T factor will be stored in its top-left sub-matrix of size
/// TileElementSize(k, k)
/// @param workspaces array of tiles used as workspace, with at least one tile per worker (see
/// get_tfactor_nworkers)
///
/// @pre reflectors in hh_panel are well formed (1s on the diagonal and 0s in the upper part)
/// @pre hh_panel.getWidth() <= t.get().size().rows && hh_panel.size().getWidth() <= t.get().size().cols()
template <Backend backend, Device device, class T>
void computeTFactor(matrix::Panel<Coord::Col, T, device>& hh_panel,
matrix::ReadOnlyTileSender<T, Device::CPU> taus,
matrix::ReadWriteTileSender<T, device> t) {
QR_Tfactor<backend, device, T>::call(hh_panel, std::move(taus), std::move(t));
matrix::ReadWriteTileSender<T, device> t,
std::vector<matrix::ReadWriteTileSender<T, device>> workspaces) {
QR_Tfactor<backend, device, T>::call(hh_panel, std::move(taus), std::move(t), std::move(workspaces));
}

/// Forms the triangular factor T of a block of reflectors H, which is defined as a product
/// of k := hh_panel.getWidth() elementary reflectors.
///
/// hh_panel should have the following form
/// H0 0 0 ... 0
/// . H1 0 ... 0
/// . . H2 ... 0
/// . . . ... 0
/// . . . ... HK-1
/// . . . ... .
/// H0 H1 H2 ... HK-1
/// Note: The first element of the HH reflectors is NOT implicitly assumed to be 1,
/// it has to be set correctly in the panel (0s as well).
///
/// It is similar to what xLARFT in LAPACK does.
/// Given @p k elementary reflectors stored in the column of @p hh_panel together with related tau values
/// in @p taus, in @p t will be formed the triangular factor for the H block of reflectors, such that
///
/// H = I - V . T . V*
///
/// where H = H1 . H2 . ... . Hk
///
/// in which Hi represents a single elementary reflector transformation.
///
/// A Storage-Efficient WY Representation for Products of Householder Transformations.
/// Schreiber, Robert & VanLoan, Charles. (1989)
/// SIAM Journal on Scientific and Statistical Computing. 10. 10.1137/0910005.
///
/// @param hh_panel where the elementary reflectors are stored
/// @param taus array of taus, associated with the related elementary reflector
/// @param t tile where the resulting T factor will be stored in its top-left sub-matrix of size
/// TileElementSize(k, k)
/// @param workspaces array of tiles used as workspace, with at least one tile per worker (see
/// get_tfactor_nworkers)
/// @param mpi_col_task_chain where internal communications are issued
///
/// @pre reflectors in hh_panel are well formed (1s on the diagonal and 0s in the upper part)
/// @pre hh_pane.getWidth() <= t.get().size().rows && hh_panel.size().getWidth() <= t.get().size().cols()
template <Backend backend, Device device, class T>
void computeTFactor(matrix::Panel<Coord::Col, T, device>& hh_panel,
matrix::ReadOnlyTileSender<T, Device::CPU> taus,
matrix::ReadWriteTileSender<T, device> t,
std::vector<matrix::ReadWriteTileSender<T, device>> workspaces,
comm::CommunicatorPipeline<comm::CommunicatorType::Col>& mpi_col_task_chain) {
QR_Tfactor<backend, device, T>::call(hh_panel, std::move(taus), std::move(t), mpi_col_task_chain);
QR_Tfactor<backend, device, T>::call(hh_panel, std::move(taus), std::move(t), std::move(workspaces),
mpi_col_task_chain);
}

}
57 changes: 5 additions & 52 deletions include/dlaf/factorization/qr/api.h
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
#pragma once

#include <complex>
#include <vector>

#include <pika/execution.hpp>

Expand All @@ -27,62 +28,14 @@ struct QR {};

template <Backend backend, Device device, class T>
struct QR_Tfactor {
/// Forms the triangular factor T of a block of reflectors H, which is defined as a product of k
/// elementary reflectors.
///
/// It is similar to what xLARFT in LAPACK does.
/// Given @p k elementary reflectors stored in the column of @p v starting at tile @p v_start,
/// together with related tau values in @p taus, in @p t will be formed the triangular factor for the H
/// block of reflector, such that
///
/// H = I - V . T . V*
///
/// where H = H1 . H2 . ... . Hk
///
/// in which Hi represents a single elementary reflector transformation
///
/// @param k the number of elementary reflectors to use (from the beginning of the tile)
/// @param v where the elementary reflectors are stored
/// @param v_start tile in @p v where the column of reflectors starts
/// @param taus tile of the taus vector, associated with the related elementary reflector
/// @param t tile where the resulting T factor will be stored in its top-left sub-matrix of size
/// TileElementSize(k, k)
///
/// @pre k <= t.get().size().rows && k <= t.get().size().cols()
/// @pre k >= 0
/// @pre v_start.isIn(v.nrTiles())
static void call(matrix::Panel<Coord::Col, T, device>& panel_view,
static void call(matrix::Panel<Coord::Col, T, device>& hh_panel,
matrix::ReadOnlyTileSender<T, Device::CPU> taus,
matrix::ReadWriteTileSender<T, device> t);

/// Forms the triangular factor T of a block of reflectors H, which is defined as a product of k
/// elementary reflectors.
///
/// It is similar to what xLARFT in LAPACK does.
/// Given @p k elementary reflectors stored in the column of @p v starting at tile @p v_start,
/// together with related tau values in @p taus, in @p t will be formed the triangular factor for the H
/// block of reflector, such that
///
/// H = I - V . T . V*
///
/// where H = H1 . H2 . ... . Hk
///
/// in which Hi represents a single elementary reflector transformation
///
/// @param k the number of elementary reflectors to use (from the beginning of the tile)
/// @param v where the elementary reflectors are stored
/// @param v_start tile in @p v where the column of reflectors starts
/// @param taus tile of the taus vector, associated with the related elementary reflector
/// @param t tile where the resulting T factor will be stored in its top-left sub-matrix of size
/// TileElementSize(k, k)
/// @param mpi_col_task_chain where internal communications are issued
///
/// @pre k <= t.get().size().rows && k <= t.get().size().cols()
/// @pre k >= 0
/// @pre v_start.isIn(v.nrTiles())
matrix::ReadWriteTileSender<T, device> t,
std::vector<matrix::ReadWriteTileSender<T, device>> workspaces);
static void call(matrix::Panel<Coord::Col, T, device>& hh_panel,
matrix::ReadOnlyTileSender<T, Device::CPU> taus,
matrix::ReadWriteTileSender<T, device> t,
std::vector<matrix::ReadWriteTileSender<T, device>> workspaces,
comm::CommunicatorPipeline<comm::CommunicatorType::Col>& mpi_col_task_chain);
};

Expand Down
Loading
Loading