Learning and practice of high performance computing

Insatll

git clone

Application

pocket-ai -- A Portable Toolkit for deploying Edge AI and HPC.

https://github.com/cjmcv/pocket-ai

Practice

cux -- An experimental framework for performance analysis and optimization of CUDA kernel functions.

https://github.com/cjmcv/hpc/tree/master/0-frameworks/cux

tag: cuda / simd / openmp.

mrpc -- Mini-RPC, based on asio.

https://github.com/cjmcv/hpc/tree/master/0-frameworks/mrpc

tag: distributed computing.

DEPRECATED

hcs A heterogeneous computing system for multi-task scheduling optimization.

vky A Vulkan-based computing framework

"hcs" and "vky" have been moved to pocket-ai and renamed as graph and vk respectively.

Learning

Distributed computing

mpi/mpi4py

alg_matrix_multiply ： gemm: C = A * B.
base_broadcast_scatter_gather ： Record the basic usage of Bcast, Scatter, Gather and Allgather.
base_group ： Group communication.
base_hello_world ： Environment Management Routines.
base_reduce_alltoall_scan ： Record the basic usage of Reduce, Allreduce, Alltoall, Scan and Exscan.
base_send_recv ： Record the basic usage of MPI_Send/MPI_Recv and MPI_ISend/MPI_IRecv.
base_type_contiguous ： Send and receive custom types of data by using MPI_Type_contiguous.
base_type_struct ： Send and receive custom types of data by using MPI_Type_struct.
util_bandwidth_test ： Test bandwidth by point-to-point communications.
py_base_broadcast_scatter_gather ： Record the basic usage of Bcast, Scatter, Gather and Allgather.
py_base_reduce_scan ： Record the basic usage of Reduce and Scan.
py_base_send_recv ： Record the basic usage of Send and Recv.

Heterogeneous computing

cuda

base_graph ： Record the basic usage of cuda graph.
base_unified_memory ： A simple task consumer using threads and streams with all data in Unified Memory.
base_zero_copy ： Record the basic usage of Zero Copy.
gemm_fp16_wmma ： Gemm fp16 - wmma
gemm_fp32 ： Gemm fp32 - cuda core

vulkan

vky

opencl

ocl_util ： Utility functions.
alg_dot_product ： Vector dot product, h_result = SUM(A * B).
alg_vector_add ： Vector addition: C = A + B.
base_platform_info ： Query OpenCL platform information.

Thread

std

alg_quick_sort： Quick sort using std::thread.
alg_vector_dot_product： Vector dot product: h_result = SUM(A * B). Record the basic usage of std::tread and std::sync.
base_async： Record the basic usage of std::async.
util_blocking_queue： Blocking queue. Mainly implemented by thread, queue and condition_variable.
util_internal_thread： Internal Thread. Mainly implemented by std::thread.
util_thread_pool： Thread Pool. Mainly implemented by thread, queue, future and condition_variable.

openmp

alg_matrix_multiply ： gemm: C = A * B.
alg_pi_calculate ： Calculate PI using parallel, for and reduction.
base_flush ： Records the basic usage of flush.
base_mutex ： Mutex operation in openmp, including critical, atomic, lock.
base_parallel_for ： Parallel and For.
base_schedule ： Records the basic usage of schedule.
base_sections_single ： Records the basic usage of Sections and Single.
base_synchronous ： Synchronous operation in openmp, including barrier, ordered and master.

tbb

base_allocator ： The basic use of allocator.
base_atomic ： The basic use of atomic.
base_concurrent_hash_map ： The basic use of concurrent_hash_map.
base_concurrent_queue ： The basic use of concurrent queue.
base_mutex ： The basic use of mutex in tbb.
base_parallel_for ： The basic use of parallel_for.
base_parallel_reduce ： The basic use of parallel_reduce.
base_parallel_scan ： The basic use of parallel_scan.
base_parallel_sort ： The basic use of base_parallel_sort.
base_task_scheduler ： The basic use of base_task_scheduler.
count_strings ： Count strings. Use the concurrent_hash_map.

Coroutines

libco

asyncio

base_future： Record the basic usage of future.
base_gather： Use gather to execute tasks in parallel.
base_hello_world： Hello world. Record the basic usage of async, await and loop.
base_loop_chain： Executes nested coroutines.

SIMD

sse/avx

matrix_multiply ： Matrix Multiplication.
matrix_transpose ： Matrix Transpose.
vector_dot_product ： Vector dot product: result = SUM(A * B).
vector_scan ： Scan. Prefix Sum.

neon

matrix_multiply : Matrix Multiplication.
matrix_transpose ： Matrix Transpose.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Learning and practice of high performance computing

Insatll

Application

Practice

Learning

Distributed computing

Heterogeneous computing

Thread

Coroutines

SIMD

Files

README.md

Latest commit

History

README.md

File metadata and controls

Learning and practice of high performance computing

Insatll

Application

Practice

Learning

Distributed computing

Heterogeneous computing

Thread

Coroutines

SIMD