Skip to content

Commit

Permalink
Gpugraph.0621 (xuewujiao#145)
Browse files Browse the repository at this point in the history
* Optimizing the zero key problem in the push phase

* Optimize CUDA thread parallelism in MergeGrad phase

* Optimize CUDA thread parallelism in MergeGrad phase

* Performance optimization, segment gradient merging

* Performance optimization, segment gradient merging

* Optimize pullsparse and increase keys aggregation

* sync gpugraph to gpugraph_v2 (xuewujiao#86)

* change load node and edge from local to cpu (xuewujiao#83)

* change load node and edge

* remove useless code

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* extract pull sparse as single stage(xuewujiao#85)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com>
Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>
Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com>
Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* [GPUGraph] graph sample v2 (xuewujiao#87)

* change load node and edge from local to cpu (xuewujiao#83)

* change load node and edge

* remove useless code

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* extract pull sparse as single stage(xuewujiao#85)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* support ssdsparsetable;test=develop (xuewujiao#81)

* graph sample v2

* remove log

Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com>
Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>
Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com>
Co-authored-by: yangjunchao <yangjunchao@baidu.com>
Co-authored-by: danleifeng <52735331+danleifeng@users.noreply.github.com>

* Release cpu graph

* uniq nodeid (xuewujiao#89)

* compatible whole HBM mode (xuewujiao#91)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* Gpugraph v2 (xuewujiao#93)

* compatible whole HBM mode

* unify flag for graph emd storage mode and graph struct storage mode

* format

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* split generate batch into multi stage (xuewujiao#92)

* split generate batch into multi stage

* fix conflict

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* [GpuGraph] Uniq feature (xuewujiao#95)

* uniq feature

* uniq feature

* uniq feature

* [GpuGraph]  global startid (xuewujiao#98)

* uniq feature

* uniq feature

* uniq feature

* global startid

* load node edge seperately and release graph (xuewujiao#99)

* load node edge seperately and release graph

* load node edge seperately and release graph

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* v2 infer (xuewujiao#102)

* optimize begin pass and end pass (xuewujiao#106)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* fix ins no (xuewujiao#104)

* [GPUGraph] fix FillOneStep args (xuewujiao#107)

* fix ins no

* fix FillOnestep args

* fix bug for whole hbm mode (xuewujiao#110)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* [GPUGraph] fix infer && add infer_table_cap (xuewujiao#108)

* fix ins no

* fix FillOnestep args

* fix infer && add infer table cap

* fix infer

* 【PSCORE】perform ssd sparse table  (xuewujiao#111)

* perform ssd sparsetable;test=develop

Conflicts:
	paddle/fluid/framework/fleet/ps_gpu_wrapper.cc

* perform ssd sparsetable;test=develop

* remove debug code;

* remove debug code;

* add jemalloc cmake;test=develop

* fix wrapper;test=develop

* fix sample core (xuewujiao#114)

* [GpuGraph] optimize shuffle batch (xuewujiao#115)

* fix sample core

* optimize shuffle batch

* release gpu mem when sample end (xuewujiao#116)

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* fix class not found err (xuewujiao#118)

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* optimize sample (xuewujiao#117)

* optimize sample

* optimize sample

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* fix clear gpu mem (xuewujiao#119)

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* fix sample core (xuewujiao#121)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* add ssd cache (xuewujiao#123)

* add ssd cache;test=develop

* add ssd cache;test=develop

* add ssd cache;test=develop

* add multi epoch train & fix train table change ins & save infer embeding  (xuewujiao#129)

* add multi epoch train & fix train table change ins & save infer embedding

* change epoch finish judge

* change epoch finish change

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* Add debug log (xuewujiao#131)

* Add debug log

* Add debug log

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com>

* optimize mem in  uniq slot feature (xuewujiao#130)

* [GpuGraph] cherry pick var slot feature && fix load multi path node (xuewujiao#136)

* optimize mem in  uniq slot feature

* cherry-pick var slot_feature

Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com>

* [GpuGraph] fix kernel overflow (xuewujiao#138)

* optimize mem in  uniq slot feature

* cherry-pick var slot_feature

* fix kernel overflow && add max feature num flag

Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com>

* fix ssd cache;test=develop (xuewujiao#139)

* slot feature secondary storage (xuewujiao#140)

* slot feature secondary storage

* slot feature secondary storage

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com>
Co-authored-by: xuewujiao <105861147+xuewujiao@users.noreply.github.com>
Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com>
Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>
Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com>
Co-authored-by: yangjunchao <yangjunchao@baidu.com>
Co-authored-by: Thunderbrook <52529258+Thunderbrook@users.noreply.github.com>
Co-authored-by: danleifeng <52735331+danleifeng@users.noreply.github.com>
Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com>
  • Loading branch information
10 people authored Nov 8, 2022
1 parent 7b0c752 commit 85707ac
Show file tree
Hide file tree
Showing 58 changed files with 3,902 additions and 1,886 deletions.
35 changes: 35 additions & 0 deletions cmake/external/jemalloc.cmake
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
include(ExternalProject)

set(JEMALLOC_PROJECT "extern_jemalloc")
set(JEMALLOC_URL
https://github.com/jemalloc/jemalloc/releases/download/5.1.0/jemalloc-5.1.0.tar.bz2
)
set(JEMALLOC_BUILD ${THIRD_PARTY_PATH}/jemalloc/src/extern_jemalloc)
set(JEMALLOC_SOURCE_DIR "${THIRD_PARTY_PATH}/jemalloc")
set(JEMALLOC_INSTALL ${THIRD_PARTY_PATH}/install/jemalloc)
set(JEMALLOC_INCLUDE_DIR ${JEMALLOC_INSTALL}/include)
set(JEMALLOC_DOWNLOAD_DIR "${JEMALLOC_SOURCE_DIR}/src/${JEMALLOC_PROJECT}")

set(JEMALLOC_STATIC_LIBRARIES
${THIRD_PARTY_PATH}/install/jemalloc/lib/libjemalloc_pic.a)
set(JEMALLOC_LIBRARIES
${THIRD_PARTY_PATH}/install/jemalloc/lib/libjemalloc_pic.a)

ExternalProject_Add(
extern_jemalloc
PREFIX ${JEMALLOC_SOURCE_DIR}
URL ${JEMALLOC_URL}
INSTALL_DIR ${JEMALLOC_INSTALL}
DOWNLOAD_DIR "${JEMALLOC_DOWNLOAD_DIR}"
BUILD_COMMAND $(MAKE)
BUILD_IN_SOURCE 1
INSTALL_COMMAND $(MAKE) install
CONFIGURE_COMMAND "${JEMALLOC_DOWNLOAD_DIR}/configure"
--prefix=${JEMALLOC_INSTALL} --disable-initial-exec-tls)

add_library(jemalloc STATIC IMPORTED GLOBAL)
set_property(TARGET jemalloc PROPERTY IMPORTED_LOCATION
${JEMALLOC_STATIC_LIBRARIES})

include_directories(${JEMALLOC_INCLUDE_DIR})
add_dependencies(jemalloc extern_jemalloc)
34 changes: 30 additions & 4 deletions cmake/external/rocksdb.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,13 @@

include(ExternalProject)

# find_package(jemalloc REQUIRED)

set(JEMALLOC_INCLUDE_DIR ${THIRD_PARTY_PATH}/install/jemalloc/include)
set(JEMALLOC_LIBRARIES
${THIRD_PARTY_PATH}/install/jemalloc/lib/libjemalloc_pic.a)
message(STATUS "rocksdb jemalloc:" ${JEMALLOC_LIBRARIES})

set(ROCKSDB_PREFIX_DIR ${THIRD_PARTY_PATH}/rocksdb)
set(ROCKSDB_INSTALL_DIR ${THIRD_PARTY_PATH}/install/rocksdb)
set(ROCKSDB_INCLUDE_DIR
Expand All @@ -22,22 +29,41 @@ set(ROCKSDB_INCLUDE_DIR
set(ROCKSDB_LIBRARIES
"${ROCKSDB_INSTALL_DIR}/lib/librocksdb.a"
CACHE FILEPATH "rocksdb library." FORCE)
set(ROCKSDB_CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fPIC")
set(ROCKSDB_COMMON_FLAGS
"-g -pipe -O2 -W -Wall -Wno-unused-parameter -fPIC -fno-builtin-memcmp -fno-omit-frame-pointer"
)
set(ROCKSDB_FLAGS
"-DNDEBUG -DROCKSDB_JEMALLOC -DJEMALLOC_NO_DEMANGLE -DROCKSDB_PLATFORM_POSIX -DROCKSDB_LIB_IO_POSIX -DOS_LINUX -DROCKSDB_FALLOCATE_PRESENT -DHAVE_SSE42 -DHAVE_PCLMUL -DZLIB -DROCKSDB_MALLOC_USABLE_SIZE -DROCKSDB_PTHREAD_ADAPTIVE_MUTEX -DROCKSDB_BACKTRACE -DROCKSDB_SUPPORT_THREAD_LOCAL -DROCKSDB_USE_RTTI -DROCKSDB_SCHED_GETCPU_PRESENT -DROCKSDB_RANGESYNC_PRESENT -DROCKSDB_AUXV_GETAUXVAL_PRESENT"
)
set(ROCKSDB_CMAKE_CXX_FLAGS
"${ROCKSDB_COMMON_FLAGS} -DROCKSDB_LIBAIO_PRESENT -msse -msse4.2 -mpclmul ${ROCKSDB_FLAGS} -fPIC -I${JEMALLOC_INCLUDE_DIR}"
)
set(ROCKSDB_CMAKE_C_FLAGS
"${ROCKSDB_COMMON_FLAGS} ${ROCKSDB_FLAGS} -DROCKSDB_LIBAIO_PRESENT -fPIC -I${JEMALLOC_INCLUDE_DIR}"
)
include_directories(${ROCKSDB_INCLUDE_DIR})

set(CMAKE_CXX_LINK_EXECUTABLE
"${CMAKE_CXX_LINK_EXECUTABLE} -pthread -ldl -lrt -lz")
ExternalProject_Add(
extern_rocksdb
${EXTERNAL_PROJECT_LOG_ARGS}
PREFIX ${ROCKSDB_PREFIX_DIR}
GIT_REPOSITORY "https://github.com/facebook/rocksdb"
GIT_TAG v6.10.1
GIT_REPOSITORY "https://github.com/Thunderbrook/rocksdb"
GIT_TAG 6.19.fb
UPDATE_COMMAND ""
CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}
-DCMAKE_C_COMPILER=${CMAKE_C_COMPILER}
-DWITH_BZ2=OFF
-DWITH_GFLAGS=OFF
-DWITH_TESTS=OFF
-DWITH_JEMALLOC=ON
-DWITH_BENCHMARK_TOOLS=OFF
-DJeMalloc_LIBRARIES=${JEMALLOC_LIBRARIES}
-DJeMalloc_INCLUDE_DIRS=${JEMALLOC_INCLUDE_DIR}
-DCMAKE_CXX_FLAGS=${ROCKSDB_CMAKE_CXX_FLAGS}
-DCMAKE_C_FLAGS=${CMAKE_C_FLAGS}
-DCMAKE_C_FLAGS=${ROCKSDB_CMAKE_C_FLAGS}
-DCMAKE_CXX_LINK_EXECUTABLE=${CMAKE_CXX_LINK_EXECUTABLE}
# BUILD_BYPRODUCTS ${ROCKSDB_PREFIX_DIR}/src/extern_rocksdb/librocksdb.a
INSTALL_COMMAND
mkdir -p ${ROCKSDB_INSTALL_DIR}/lib/ && cp
Expand Down
3 changes: 3 additions & 0 deletions cmake/third_party.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -422,6 +422,9 @@ if(WITH_PSCORE)

include(external/rocksdb) # download, build, install rocksdb
list(APPEND third_party_deps extern_rocksdb)

include(external/jemalloc) # download, build, install jemalloc
list(APPEND third_party_deps extern_jemalloc)
endif()

if(WITH_XBYAK)
Expand Down
15 changes: 13 additions & 2 deletions paddle/fluid/distributed/ps/service/ps_client.h
Original file line number Diff line number Diff line change
Expand Up @@ -148,10 +148,12 @@ class PSClient {
return fut;
}

virtual ::std::future<int32_t> PullSparsePtr(char **select_values,
virtual ::std::future<int32_t> PullSparsePtr(int shard_id,
char **select_values,
size_t table_id,
const uint64_t *keys,
size_t num) {
size_t num,
uint16_t pass_id) {
VLOG(0) << "Did not implement";
std::promise<int32_t> promise;
std::future<int> fut = promise.get_future();
Expand All @@ -160,6 +162,15 @@ class PSClient {
}

virtual std::future<int32_t> PrintTableStat(uint32_t table_id) = 0;
virtual std::future<int32_t> SaveCacheTable(uint32_t table_id,
uint16_t pass_id,
size_t threshold) {
VLOG(0) << "Did not implement";
std::promise<int32_t> promise;
std::future<int> fut = promise.get_future();
promise.set_value(-1);
return fut;
}

// 确保所有积攒中的请求都发起发送
virtual std::future<int32_t> Flush() = 0;
Expand Down
30 changes: 28 additions & 2 deletions paddle/fluid/distributed/ps/service/ps_local_client.cc
Original file line number Diff line number Diff line change
Expand Up @@ -260,10 +260,12 @@ ::std::future<int32_t> PsLocalClient::PushDense(const Region* regions,
// return done();
//}

::std::future<int32_t> PsLocalClient::PullSparsePtr(char** select_values,
::std::future<int32_t> PsLocalClient::PullSparsePtr(int shard_id,
char** select_values,
size_t table_id,
const uint64_t* keys,
size_t num) {
size_t num,
uint16_t pass_id) {
// FIXME
// auto timer =
// std::make_shared<CostTimer>("pslib_downpour_client_pull_sparse");
Expand All @@ -278,13 +280,37 @@ ::std::future<int32_t> PsLocalClient::PullSparsePtr(char** select_values,
table_context.pull_context.ptr_values = select_values;
table_context.use_ptr = true;
table_context.num = num;
table_context.shard_id = shard_id;
table_context.pass_id = pass_id;

// table_ptr->PullSparsePtr(select_values, keys, num);
table_ptr->Pull(table_context);

return done();
}

::std::future<int32_t> PsLocalClient::PrintTableStat(uint32_t table_id) {
auto* table_ptr = GetTable(table_id);
std::pair<int64_t, int64_t> ret = table_ptr->PrintTableStat();
VLOG(0) << "table id: " << table_id << ", feasign size: " << ret.first
<< ", mf size: " << ret.second;
return done();
}

::std::future<int32_t> PsLocalClient::SaveCacheTable(uint32_t table_id,
uint16_t pass_id,
size_t threshold) {
auto* table_ptr = GetTable(table_id);
std::pair<int64_t, int64_t> ret = table_ptr->PrintTableStat();
VLOG(0) << "table id: " << table_id << ", feasign size: " << ret.first
<< ", mf size: " << ret.second;
if (ret.first > threshold) {
VLOG(0) << "run cache table";
table_ptr->CacheTable(pass_id);
}
return done();
}

::std::future<int32_t> PsLocalClient::PushSparseRawGradient(
size_t table_id,
const uint64_t* keys,
Expand Down
17 changes: 9 additions & 8 deletions paddle/fluid/distributed/ps/service/ps_local_client.h
Original file line number Diff line number Diff line change
Expand Up @@ -76,18 +76,19 @@ class PsLocalClient : public PSClient {
return fut;
}

virtual ::std::future<int32_t> PullSparsePtr(char** select_values,
virtual ::std::future<int32_t> PullSparsePtr(int shard_id,
char** select_values,
size_t table_id,
const uint64_t* keys,
size_t num);
size_t num,
uint16_t pass_id);

virtual ::std::future<int32_t> PrintTableStat(uint32_t table_id) {
std::promise<int32_t> prom;
std::future<int32_t> fut = prom.get_future();
prom.set_value(0);
virtual ::std::future<int32_t> PrintTableStat(uint32_t table_id);

virtual ::std::future<int32_t> SaveCacheTable(uint32_t table_id,
uint16_t pass_id,
size_t threshold);

return fut;
}
virtual ::std::future<int32_t> PushSparse(size_t table_id,
const uint64_t* keys,
const float** update_values,
Expand Down
9 changes: 9 additions & 0 deletions paddle/fluid/distributed/ps/table/accessor.h
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,15 @@ class ValueAccessor {
return 0;
}

virtual bool SaveMemCache(float* value,
int param,
double global_cache_threshold,
uint16_t pass_id) {
return true;
}

virtual void UpdatePassId(float* value, uint16_t pass_id) {}

virtual float GetField(float* value, const std::string& name) { return 0.0; }
#define DEFINE_GET_INDEX(class, field) \
virtual int get_##field##_index() override { return class ::field##_index(); }
Expand Down
Loading

0 comments on commit 85707ac

Please sign in to comment.