-
Notifications
You must be signed in to change notification settings - Fork 1
CUDA belief propagation as presented in paper "GPU implementation of belief propagation using CUDA for cloud tracking and reconstruction" published at the 2008 IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS 2008). Code has been updated to work on current NVIDIA GPUs and with additional optimizations. Cite paper if using this code.
License
sgrauerg6/cudaBeliefProp
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
/* Copyright (C) 2024 Scott Grauer-Gray This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ GPU and optimized CPU Implementations of Belief Propagation for Stereo This README describes an implementation of the CUDA belief propagation algorithm originally described in "GPU implementation of belief propagation using CUDA for cloud tracking and reconstruction" which was published at the 2008 IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS 2008) and can be found at the following address: https://sgrauerg6.github.io/cudaBeliefProp.pdf. Please cite this work if using this code for GPU acceleration as part of a research paper or other work. Will note that there have been changes from the initial implementation as presented in the paper, both for optimization, to work on the latest NVIDIA GPUs with the latest CUDA toolkits, and to add a parallel CPU implementation. Also, the motion estimation portion of the code was removed to narrow the focus of this work. The additional optimizations as well as the parallel CPU implementation are described in https://sgrauerg6.github.io/OptimizingGlobalStereoMatchingOnNVIDIAGPUsAndCPUs.pdf. The original code is available at https://sgrauerg6.github.io/beliefPropCodePost.zip. Doxygen documentation for current code: https://sgrauerg6.github.io/Documentation/cudaBeliefProp/html/index.html This code is distributed using the GNU General Public License, so any derivative work which is distributed must also contain this license. See "RUNTIME RESULTS" section at end of README for runtime info across various processors including the NVIDIA A100 and H100 GPUs, AMD Genoa and Genoa-X CPUs, Intel Emerald Rapids and Sapphire Rapids CPUs, and Amazon Graviton, Microsoft Cobalt, and NVIDIA Grace ARM server CPUs. Please email comments and bug reports to sgrauerg@gmail.com INSTRUCTIONS - EVALUATE OPTIMIZED IMPLEMENTATION ON TARGET ARCHITECTURE VS OTHER ARCHITECTURES: This code runs and evaluates belief propagation for stereo vision on the CPU and GPU using CUDA on Linux-based systems. The parameters are defined in BpResultsEvaluation/BpEvaluationStereoSets.h and are documented in that file and the paper. In order to compile/re-compile the program, navigate to the directory beliefprop/LinuxMakeAndRunBPOptimizedCPU (for optimized CPU implementation on x86 processors), the directory beliefprop/LinuxMakeAndRunBPOptimizedCPU_ARM (for optimized CPU implementation on ARM processors), or LinuxMakeAndRunBPCUDA (for CUDA implementation) and run "make clean" and "make". A number of stereo sets including the Tsukuba stereo are included in the "BeliefProp/BpStereoSets" folder. In order to run and evaluate the implementation across six of these stereo sets using default parameters, perform the following steps: 1a. (CUDA ONLY) If needed, set the PATH and LD_LIBRARY_PATH to the necessary paths needed to run CUDA programs (usually PATH is appended with $CUDA_DIR/bin and LD_LIBRARY_PATH=$CUDA_DIR/lib64). -If using a CUDA device with Compute Capability less than 8.0, need to switch half-precision processing from default bfloat16 datatype to half datatype by removing flag to use bfloat16 datatype in "BFLOAT_DATA_TYPE_SETTING = -DUSE_BFLOAT16_FOR_HALF_PRECISION" line in beliefprop/LinuxMakeAndRunBPCUDA/Makefile. Note that this does make it so that the results in the test with conesFullSizeCropped at half precision will not be "good" due to half-precision value(s) becoming infinity or NaN when processing that stereo set. 1b. (CPU ONLY) Make any necessary or desired adjustments to run on target CPU: -On x86 CPU, AVX-512 vectorization is used by default. If AVX-512 is not supported on target CPU but AVX-256 is supported, set vectorization to AVX-256 by changing line in beliefprop/LinuxMakeAndRunBPOptimizedCPU/Makefile.h from "CPU_VECTORIZATION_DEFINE = -DAVX_512_VECTORIZATION" to "CPU_VECTORIZATION_DEFINE = -DAVX_256_VECTORIZATION". -If on a multi-CPU system, may want to adjust environment variables OMP_PLACES and OMP_PROC_BIND to optimize OpenMP thread configuration. In some tests on multi-CPU systems have gotten faster results with setting OMP_PROC_BIND=true and OMP_PLACES="sockets". 2. By default, belief propagation is run on 6 stereo sets using the float, double, and half datatypes with and without using templated disparity counts for each stereo set, and is run up to 15 times in each configuration for accurate benchmarking with median runtimes used in evaluation. As a result, it takes awhile to run the full evaluation. For quicker results, can limit the number of stereo sets used, datatypes used in evaluation, optimization options, templated disparity count options, and number of runs for each configuration by enabling defines in the common.mk file that are included in the Makefile for each implementation. 3. If running optimized CPU implementation on x86 CPU, navigate to the folder beliefprop/LinuxMakeAndRunBPOptimizedCPU. If running optimized CPU implementation on ARM CPU, navigate to the folder beliefprop/LinuxMakeAndRunBPOptimizedCPU_ARM. If running CUDA implemenetation, navigate to folder beliefprop/LinuxMakeAndRunBPCUDA. 4. Execute the commands "make clean" and "make" on the command line. 5. Run "./driverCPUBp {RUN_NAME}" or "./driverCudaBp {RUN_NAME}", calling generated executable "driver{CPU/CUDA}Bp" with input parameter {RUN_NAME} to run the implementation on multiple stereo sets and in multiple configurations (run name cannot have any spaces). Implementation is run multiple times on each run configuration to more accurately measure the runtime. The single-thread CPU implementation code from http://cs.brown.edu/people/pfelzens/bp/index.html is also run to compare the output disparity maps/runtimes from the original single-thread implementation with the optimized implementations. If no {RUN_NAME} parameter is given, run name is set to "CurrentRun". Recommended that {RUN_NAME} includes CPU/GPU architecture so can easily match run results with corresponding architecture. -Example {RUN_NAME}: "AMDRome48Cores" which corresponds to runs on a 48-core CPU with the AMD Rome architecture and is used as baseline when computing runtime speedup. 6. The locations of the output disparity maps are given in the console output. Detailed input and results for each run including input stereo set and parameters, detailed timings along with overall speedup data, output disparity map evaluation as compared to ground truth for each run are written to files in subfolders of beliefprop/impResults folder with the file names and paths for each result file given in the console output. A comparison of runtimes and speedups across runs with results in beliefprop/impResults is written to beliefprop/impResults/EvaluationAcrossRuns.csv, with the runs ordered from fastest to slowest; each run is labeled using {RUN_NAME} from (5). RUNTIME RESULTS: Relative speedup using optimized implementation across GPUs/CPUs using float type w/ prior run on AMD Rome (48 cores) used as baseline: -Column marked as "Median Optimized Runtime (including transfer time)" in beliefprop/impResults/RunResults/{RUN_NAME}_RunResults.csv used for runtime comparison to get relative speedup for each run -Displayed speedup is average relative speedup compared to baseline across a number of runs with different input stereo sets and/or run configurations NVIDIA H100 (GH200): 3.34x NVIDIA H100 (SXM5): 2.99x AMD Genoa-X (176 Cores across 2 CPUs): 2.31x NVIDIA H100 (PCIe): 2.27x AMD Genoa-X (88 Cores*): 2.12x AMD Genoa (192 Cores across 2 CPUs): 2.07x AMD Genoa (96 Cores*): 1.98x Amazon Graviton4 (96 ARM cores): 1.97x Microsoft Cobalt (96 ARM cores): 1.91x NVIDIA A100: 1.89x Intel Emerald Rapids (48 cores): 1.84x NVIDIA RTX 3090 Ti: 1.65x AMD Milan-X (120 Cores across 2 CPUs): 1.60x Intel Sapphire Rapids (48 cores): 1.53x NVIDIA Grace CPU in GH200 (64 ARM cores): 1.44x AMD Milan-X (60 cores*): 1.37x Amazon Graviton3 (64 ARM cores): 1.20x Amazon Graviton3E (64 ARM cores): 1.20x NVIDIA V100: 1.15x AMD Rome (48 Cores): 1.00x Intel Ice Lake (32 cores): 1.00x AMD Milan (48 cores): 0.91x Amazon Graviton2 (64 ARM cores): 0.90x NVIDIA P100: 0.78x Intel Cascade Lake (24 cores): 0.66x Intel Tiger Lake laptop (8 Cores): 0.33x AMD 3600 (6 Cores): 0.27x Spreadsheet with detailed results: https://docs.google.com/spreadsheets/d/1j_JWKx9K2puHnWB0qRaxP3TyEkMDaoNpiLhUHleev-w/edit?usp=sharing *Run on system with 2 socketed CPUs with total of 2x listed core count across CPUs and with SMT disabled...benchmarked runs are with max number of threads limited to thread count on single CPU w/ threads pinned to socket so should be using single CPU in the run. Commands to pin OMP threads to socket: export OMP_PLACES="sockets" export OMP_PROC_BIND=true Single-thread implementation speedups (relative single-core performance): Intel Emerald Rapids (48 cores): 1.48x Intel Sapphire Rapids (48 cores): 1.41x NVIDIA Grace CPU in GH200 (64 ARM cores): 1.36x Intel Tiger Lake laptop (8 Cores): 1.31x AMD Genoa-X (88 Cores): 1.24x Microsoft Cobalt (96 ARM cores): 1.13x AMD Genoa (96 Cores): 1.12x Amazon Graviton4 (96 ARM cores): 1.10x Intel Ice Lake (32 cores): 1.07x AMD 3600 (6 Cores): 1.00x AMD Milan (48 cores): 0.97x AMD Milan-X (60 cores): 0.95x Intel Cascade Lake (24 cores): 0.92x Amazon Graviton3 (64 ARM cores): 0.86x Amazon Graviton3E (64 ARM cores): 0.86x AMD Rome (48 Cores): 0.78x Amazon Graviton2 (64 ARM cores): 0.63x
About
CUDA belief propagation as presented in paper "GPU implementation of belief propagation using CUDA for cloud tracking and reconstruction" published at the 2008 IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS 2008). Code has been updated to work on current NVIDIA GPUs and with additional optimizations. Cite paper if using this code.
Topics
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published