GitHub - sgrauerg6/cudaBeliefProp: CUDA belief propagation as presented in paper "GPU implementation of belief propagation using CUDA for cloud tracking and reconstruction" published at the 2008 IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS 2008). Code has been updated to work on current NVIDIA GPUs and with additional optimizations. Cite paper if using this code.

sgrauerg6 / cudaBeliefProp Public

Notifications You must be signed in to change notification settings
Fork 1
Star 3

CUDA belief propagation as presented in paper "GPU implementation of belief propagation using CUDA for cloud tracking and reconstruction" published at the 2008 IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS 2008). Code has been updated to work on current NVIDIA GPUs and with additional optimizations. Cite paper if using this code.

GPL-2.0 license

3 stars 1 fork Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1,097 Commits
.settings		.settings
BeliefProp		BeliefProp
Documentation		Documentation
PolybenchCUDA		PolybenchCUDA
src		src
.cproject		.cproject
.gitattributes		.gitattributes
.gitignore		.gitignore
.project		.project
COPYING		COPYING
Doxyfile		Doxyfile
OptimizingGlobalStereoMatchingOnNVIDIAGPUsAndCPUs.pdf		OptimizingGlobalStereoMatchingOnNVIDIAGPUsAndCPUs.pdf
README		README
common.mk		common.mk

Repository files navigation

/*
Copyright (C) 2024 Scott Grauer-Gray

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307 USA
*/

GPU and optimized CPU Implementations of Belief Propagation for Stereo

This README describes an implementation of the CUDA belief propagation algorithm
originally described in "GPU implementation of belief propagation using CUDA
for cloud tracking and reconstruction" which was published at the 2008 IAPR
Workshop on Pattern Recognition in Remote Sensing (PRRS 2008) and can be 
found at the following address: https://sgrauerg6.github.io/cudaBeliefProp.pdf.
Please cite this work if using this code for GPU acceleration as part of a
research paper or other work.

Will note that there have been changes from the initial implementation as
presented in the paper, both for optimization, to work on the latest NVIDIA GPUs
with the latest CUDA toolkits, and to add a parallel CPU implementation.
Also, the motion estimation portion of the code was removed to narrow the focus
of this work.  The additional optimizations as well as the parallel CPU
implementation are described in
https://sgrauerg6.github.io/OptimizingGlobalStereoMatchingOnNVIDIAGPUsAndCPUs.pdf.

The original code is available at
https://sgrauerg6.github.io/beliefPropCodePost.zip.

Doxygen documentation for current code:
https://sgrauerg6.github.io/Documentation/cudaBeliefProp/html/index.html

This code is distributed using the GNU General Public License, so any derivative
work which is distributed must also contain this license.

See "RUNTIME RESULTS" section at end of README for runtime
info across various processors including the NVIDIA A100 and H100 GPUs,
AMD Genoa and Genoa-X CPUs, Intel Emerald Rapids and Sapphire Rapids CPUs,
and Amazon Graviton, Microsoft Cobalt, and NVIDIA Grace ARM server CPUs.

Please email comments and bug reports to sgrauerg@gmail.com


INSTRUCTIONS - EVALUATE OPTIMIZED IMPLEMENTATION ON TARGET ARCHITECTURE VS
OTHER ARCHITECTURES:

This code runs and evaluates belief propagation for stereo vision on the CPU
and GPU using CUDA on Linux-based systems.

The parameters are defined in BpResultsEvaluation/BpEvaluationStereoSets.h and
are documented in that file and the paper.  
In order to compile/re-compile the program, navigate to the directory
beliefprop/LinuxMakeAndRunBPOptimizedCPU (for optimized CPU
implementation on x86 processors), the directory 
beliefprop/LinuxMakeAndRunBPOptimizedCPU_ARM (for optimized CPU implementation
on ARM processors), or LinuxMakeAndRunBPCUDA (for CUDA implementation) and run
"make clean" and "make".

A number of stereo sets including the Tsukuba stereo are included in the
"BeliefProp/BpStereoSets" folder.  In order to run and evaluate the
implementation across six of these stereo sets using default parameters,
perform the following steps:

1a.  (CUDA ONLY) If needed, set the PATH and LD_LIBRARY_PATH to the necessary
paths needed to run CUDA programs (usually PATH is appended with $CUDA_DIR/bin
and LD_LIBRARY_PATH=$CUDA_DIR/lib64).

-If using a CUDA device with Compute Capability less than 8.0, need to switch
half-precision processing from default bfloat16 datatype to half datatype by
removing flag to use bfloat16 datatype in
"BFLOAT_DATA_TYPE_SETTING = -DUSE_BFLOAT16_FOR_HALF_PRECISION" line in 
beliefprop/LinuxMakeAndRunBPCUDA/Makefile. Note that this does make it so that
the results in the test with conesFullSizeCropped at half precision will not
be "good" due to half-precision value(s) becoming infinity or NaN when
processing that stereo set.

1b.  (CPU ONLY) Make any necessary or desired adjustments to run on target CPU:

-On x86 CPU, AVX-512 vectorization is used by default.  If AVX-512 is not
supported on target CPU but AVX-256 is supported, set vectorization
to AVX-256 by changing line in
beliefprop/LinuxMakeAndRunBPOptimizedCPU/Makefile.h from
"CPU_VECTORIZATION_DEFINE = -DAVX_512_VECTORIZATION" to
"CPU_VECTORIZATION_DEFINE = -DAVX_256_VECTORIZATION".

-If on a multi-CPU system, may want to adjust environment variables OMP_PLACES
and OMP_PROC_BIND to optimize OpenMP thread configuration. In some tests on
multi-CPU systems have gotten faster results with setting OMP_PROC_BIND=true
and OMP_PLACES="sockets".

2.  By default, belief propagation is run on 6 stereo sets using the float,
double, and half datatypes with and without using templated disparity counts
for each stereo set, and is run up to 15 times in each configuration for
accurate benchmarking with median runtimes used in evaluation. As a result,
it takes awhile to run the full evaluation. For quicker results, can limit the
number of stereo sets used, datatypes used in evaluation, optimization options,
templated disparity count options, and number of runs for each configuration
by enabling defines in the common.mk file that are included in the Makefile
for each implementation.

3.  If running optimized CPU implementation on x86 CPU, navigate to the folder 
    beliefprop/LinuxMakeAndRunBPOptimizedCPU.
    If running optimized CPU implementation on ARM CPU, navigate to the folder
    beliefprop/LinuxMakeAndRunBPOptimizedCPU_ARM.
    If running CUDA implemenetation, navigate to folder
    beliefprop/LinuxMakeAndRunBPCUDA.

4.  Execute the commands "make clean" and "make" on the command line.

5.  Run "./driverCPUBp {RUN_NAME}" or "./driverCudaBp {RUN_NAME}",
calling generated executable "driver{CPU/CUDA}Bp" with input parameter
{RUN_NAME} to run the implementation on multiple stereo sets and in multiple
configurations (run name cannot have any spaces). Implementation is run
multiple times on each run configuration to more accurately measure the
runtime. The single-thread CPU implementation code from
http://cs.brown.edu/people/pfelzens/bp/index.html is also run to compare the
output disparity maps/runtimes from the original single-thread implementation
with the optimized implementations. If no {RUN_NAME} parameter is given, run
name is set to "CurrentRun". Recommended that {RUN_NAME} includes CPU/GPU
architecture so can easily match run results with corresponding architecture.
-Example {RUN_NAME}: "AMDRome48Cores" which corresponds to runs on a 48-core
CPU with the AMD Rome architecture and is used as baseline when computing
runtime speedup.

6.  The locations of the output disparity maps are given in the console output.
Detailed input and results for each run including input stereo set and
parameters, detailed timings along with overall speedup data, output disparity
map evaluation as compared to ground truth for each run are written to files in
subfolders of beliefprop/impResults folder with the file names and paths for
each result file given in the console output. A comparison of runtimes and
speedups across runs with results in beliefprop/impResults is written to
beliefprop/impResults/EvaluationAcrossRuns.csv, with the runs ordered from
fastest to slowest; each run is labeled using {RUN_NAME} from (5).

RUNTIME RESULTS:

Relative speedup using optimized implementation across GPUs/CPUs using
float type w/ prior run on AMD Rome (48 cores) used as baseline:
-Column marked as "Median Optimized Runtime (including transfer time)" in 
beliefprop/impResults/RunResults/{RUN_NAME}_RunResults.csv used for runtime
comparison to get relative speedup for each run
-Displayed speedup is average relative speedup compared to baseline across a
number of runs with different input stereo sets and/or run configurations
NVIDIA H100 (GH200): 3.34x
NVIDIA H100 (SXM5): 2.99x
AMD Genoa-X (176 Cores across 2 CPUs): 2.31x
NVIDIA H100 (PCIe): 2.27x
AMD Genoa-X (88 Cores*): 2.12x
AMD Genoa (192 Cores across 2 CPUs): 2.07x
AMD Genoa (96 Cores*): 1.98x
Amazon Graviton4 (96 ARM cores): 1.97x
Microsoft Cobalt (96 ARM cores): 1.91x
NVIDIA A100: 1.89x
Intel Emerald Rapids (48 cores): 1.84x
NVIDIA RTX 3090 Ti: 1.65x
AMD Milan-X (120 Cores across 2 CPUs): 1.60x
Intel Sapphire Rapids (48 cores): 1.53x
NVIDIA Grace CPU in GH200 (64 ARM cores): 1.44x
AMD Milan-X (60 cores*): 1.37x
Amazon Graviton3 (64 ARM cores): 1.20x
Amazon Graviton3E (64 ARM cores): 1.20x
NVIDIA V100: 1.15x
AMD Rome (48 Cores): 1.00x
Intel Ice Lake (32 cores): 1.00x
AMD Milan (48 cores): 0.91x
Amazon Graviton2 (64 ARM cores): 0.90x
NVIDIA P100: 0.78x
Intel Cascade Lake (24 cores): 0.66x
Intel Tiger Lake laptop (8 Cores): 0.33x
AMD 3600 (6 Cores): 0.27x
Spreadsheet with detailed results:
https://docs.google.com/spreadsheets/d/1j_JWKx9K2puHnWB0qRaxP3TyEkMDaoNpiLhUHleev-w/edit?usp=sharing

*Run on system with 2 socketed CPUs with total of 2x listed core count across
CPUs and with SMT disabled...benchmarked runs are with max number of threads
limited to thread count on single CPU w/ threads pinned to socket so should be
using single CPU in the run.
Commands to pin OMP threads to socket:
export OMP_PLACES="sockets"
export OMP_PROC_BIND=true

Single-thread implementation speedups (relative single-core performance): 
Intel Emerald Rapids (48 cores): 1.48x
Intel Sapphire Rapids (48 cores): 1.41x
NVIDIA Grace CPU in GH200 (64 ARM cores): 1.36x
Intel Tiger Lake laptop (8 Cores): 1.31x
AMD Genoa-X (88 Cores): 1.24x
Microsoft Cobalt (96 ARM cores): 1.13x
AMD Genoa (96 Cores): 1.12x
Amazon Graviton4 (96 ARM cores): 1.10x
Intel Ice Lake (32 cores): 1.07x
AMD 3600 (6 Cores): 1.00x
AMD Milan (48 cores): 0.97x
AMD Milan-X (60 cores): 0.95x
Intel Cascade Lake (24 cores): 0.92x
Amazon Graviton3 (64 ARM cores): 0.86x
Amazon Graviton3E (64 ARM cores): 0.86x
AMD Rome (48 Cores): 0.78x
Amazon Graviton2 (64 ARM cores): 0.63x