Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add script to check the cpu env #6291

Merged
merged 4 commits into from
Dec 7, 2017

Conversation

tensor-tang
Copy link
Contributor

@tensor-tang tensor-tang commented Dec 5, 2017

fix #5960

The result would be shown like this:

========================= Hardware Information =========================
CPU Name               : Intel(R) Xeon(R) Gold 6148M CPU @ 2.40GHz
CPU Family             : 6
Socket Number          : 2
Cores Per Socket       : 20
Total Physical Cores   : 40
Total Virtual Cores    : 40
Hyper Threading        : OFF
NUMA Nodes             : 2
-------------------------- Memory Information --------------------------
DIMMs max slots        : 24
Installed DIMM number  : 12
Memory Size            : 376G
Swap Memory Size       : 4.0G
Total Memory Size      : 380G
Max Memory Capacity    : 2304 GB
Configed Clock Speed   : 2666 MHz
-------------------------- Turbo Information  --------------------------
Scaling Driver         : intel_pstate
Turbo Status           : ON
CPU Max Frequency      : 3.70 GHz
CPU Min Frequency      : 1.00 GHz
CPU Freq Governor      : performance
========================= Software Information =========================
BIOS Release Date      : 03/10/2017
OS Version             : CentOS Linux release 7.3.1611 (Core)
Kernel Release Version : 3.10.0-514.el7.x86_64
Kernel Patch Version   : #1 SMP Tue Nov 22 16:42:41 UTC 2016
GCC Version            : 4.8.5 20150623 (Red Hat 4.8.5-11)
CMake Version          : 3.5.2
------------------ Environment Variables Information -------------------
KMP_AFFINITY           : unset
OMP_DYNAMIC            : unset
OMP_NESTED             : unset
OMP_NUM_THREADS        : unset
MKL_NUM_THREADS        : unset
MKL_DYNAMIC            : unset

@tensor-tang
Copy link
Contributor Author

Added DIMMs Locator info。
They can be used to determine if the memory installation is reasonable.

The result :

 ./check_env.sh
========================= Hardware Information =========================
CPU Name               : Intel(R) Xeon(R) Gold 6148M CPU @ 2.40GHz
CPU Family             : 6
Socket Number          : 2
Cores Per Socket       : 20
Total Physical Cores   : 40
Total Virtual Cores    : 40
Hyper Threading        : OFF
NUMA Nodes             : 2
-------------------------- Memory Information --------------------------
Installed DIMM number  : 12
Installed DIMMs Locator:
 CPU1_DIMM_A1
 CPU1_DIMM_B1
 CPU1_DIMM_C1
 CPU1_DIMM_D1
 CPU1_DIMM_E1
 CPU1_DIMM_F1
 CPU2_DIMM_A1
 CPU2_DIMM_B1
 CPU2_DIMM_C1
 CPU2_DIMM_D1
 CPU2_DIMM_E1
 CPU2_DIMM_F1
Not installed DIMMs    :
 CPU1_DIMM_A2
 CPU1_DIMM_B2
 CPU1_DIMM_C2
 CPU1_DIMM_D2
 CPU1_DIMM_E2
 CPU1_DIMM_F2
 CPU2_DIMM_A2
 CPU2_DIMM_B2
 CPU2_DIMM_C2
 CPU2_DIMM_D2
 CPU2_DIMM_E2
 CPU2_DIMM_F2
DIMMs max slots        : 24
Memory Size            : 376G
Swap Memory Size       : 4.0G
Total Memory Size      : 380G
Max Memory Capacity    : 2304 GB
Configed Clock Speed   : 2666 MHz
-------------------------- Turbo Information  --------------------------
Scaling Driver         : intel_pstate
Turbo Status           : ON
CPU Max Frequency      : 3.70 GHz
CPU Min Frequency      : 1.00 GHz
CPU Freq Governor      : performance
========================= Software Information =========================
BIOS Release Date      : 03/10/2017
OS Version             : CentOS Linux release 7.3.1611 (Core)
Kernel Release Version : 3.10.0-514.el7.x86_64
Kernel Patch Version   : #1 SMP Tue Nov 22 16:42:41 UTC 2016
GCC Version            : 4.8.5 20150623 (Red Hat 4.8.5-11)
CMake Version          : 3.5.2
------------------ Environment Variables Information -------------------
KMP_AFFINITY           : unset
OMP_DYNAMIC            : unset
OMP_NESTED             : unset
OMP_NUM_THREADS        : unset
MKL_NUM_THREADS        : unset
MKL_DYNAMIC            : unset

@luotao1
Copy link
Contributor

luotao1 commented Dec 7, 2017

我在两台服务器(没有avx2和有avx2上)分别跑了脚本,结果如下,看起来有些命令格式不兼容

  • 没有avx2:
========================= Hardware Information =========================
CPU Name               : 
CPU Family             : 6
Socket Number          : 2
Cores Per Socket       : 6
Total Physical Cores   : 12
Total Virtual Cores    : 12
Hyper Threading        : OFF
NUMA Nodes             : 2
-------------------------- Memory Information --------------------------
/dev/mem: Permission denied
Installed DIMM number  : 0
/dev/mem: Permission denied
/dev/mem: Permission denied
Installed DIMMs Locator: 
Not installed DIMMs    : 
/dev/mem: Permission denied
DIMMs max slots        : 0
free: invalid option -- 'h'
usage: free [-b|-k|-m|-g] [-l] [-o] [-t] [-s delay] [-c count] [-V]
  -b,-k,-m,-g show output in bytes, KB, MB, or GB
  -l show detailed low and high memory statistics
  -o use old format (no -/+buffers/cache line)
  -t display total for RAM + swap
  -s update every [delay] seconds
  -c update [count] times
  -V display version information and exit
Memory Size            : 
free: invalid option -- 'h'
usage: free [-b|-k|-m|-g] [-l] [-o] [-t] [-s delay] [-c count] [-V]
  -b,-k,-m,-g show output in bytes, KB, MB, or GB
  -l show detailed low and high memory statistics
  -o use old format (no -/+buffers/cache line)
  -t display total for RAM + swap
  -s update every [delay] seconds
  -c update [count] times
  -V display version information and exit
Swap Memory Size       : 
free: invalid option -- 'h'
usage: free [-b|-k|-m|-g] [-l] [-o] [-t] [-s delay] [-c count] [-V]
  -b,-k,-m,-g show output in bytes, KB, MB, or GB
  -l show detailed low and high memory statistics
  -o use old format (no -/+buffers/cache line)
  -t display total for RAM + swap
  -s update every [delay] seconds
  -c update [count] times
  -V display version information and exit
Total Memory Size      : 
/dev/mem: Permission denied
Max Memory Capacity    : 
/dev/mem: Permission denied
Configed Clock Speed   : 
/dev/mem: Permission denied
Warning: Have more than 1 speed type, all DIMMs should have same fequency: 
-------------------------- Turbo Information  --------------------------
cat: /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver: No such file or directory
Scaling Driver         : 
check_env.sh: line 88: [: ==: unary operator expected
Warning: Scaling driver is not intel_pstarte, maybe should enable it in BIOS
Turbo Status           : Unknown
cat: /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq: No such file or directory
cat: /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq: No such file or directory
Error: the max_frequency of all CPU should be equal
Error: the min_frequency of all CPU should be equal
cat: /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq: No such file or directory
awk: BEGIN{printf "%.2f",( / 1000000)}
awk:                        ^ unterminated regexp
awk: cmd. line:1: BEGIN{printf "%.2f",( / 1000000)}
awk: cmd. line:1:                                  ^ unexpected newline or end of string
cat: /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq: No such file or directory
awk: BEGIN{printf "%.2f",( / 1000000)}
awk:                        ^ unterminated regexp
awk: cmd. line:1: BEGIN{printf "%.2f",( / 1000000)}
awk: cmd. line:1:                                  ^ unexpected newline or end of string
CPU Max Frequency      :  GHz
CPU Min Frequency      :  GHz
cat: /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor: No such file or directory
Error: the governor of all CPU should be the same
cat: /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor: No such file or directory
CPU Freq Governor      : 
========================= Software Information =========================
/dev/mem: Permission denied
BIOS Release Date      : 
OS Version             : CentOS release 6.3 (Final)
Kernel Release Version : 3.10.0_2-0-0-0
Kernel Patch Version   : #1 SMP Tue Sep 20 17:18:54 CST 2016
GCC Version            : 4.8.3
CMake Version          : 3.2.2
------------------ Environment Variables Information -------------------
KMP_AFFINITY           : unset
OMP_DYNAMIC            : unset
OMP_NESTED             : unset
OMP_NUM_THREADS        : unset
MKL_NUM_THREADS        : unset
MKL_DYNAMIC            : unset
Found MKLML            : /home/luotao02/.jumbo/lib/libmklml_intel.so
Found IOMP             : /home/luotao02/.jumbo/lib/libiomp5.so
find: `/home/hpc/soft/python27/lib/': No such file or directory
find: `/home/hpc/soft/python27/lib/': No such file or directory
find: `/home/hpc/soft/python27/lib/': No such file or directory
find: `/home/hpc/soft/maui/lib': No such file or directory
find: `/home/hpc/soft/maui/lib': No such file or directory
find: `/home/hpc/soft/maui/lib': No such file or directory
find: `/home/hpc/soft/openmpi/lib': No such file or directory
find: `/home/hpc/soft/openmpi/lib': No such file or directory
find: `/home/hpc/soft/openmpi/lib': No such file or directory
find: `/home/hpc/soft/torque/lib': No such file or directory
find: `/home/hpc/soft/torque/lib': No such file or directory
find: `/home/hpc/soft/torque/lib': No such file or directory
find: `/home/hpc/soft/hadoop-client/hadoop/lib': No such file or directory
find: `/home/hpc/soft/hadoop-client/hadoop/lib': No such file or directory
find: `/home/hpc/soft/hadoop-client/hadoop/lib': No such file or directory
find: `/home/hpc/soft/hadoop-client/hadoop/libhce/lib': No such file or directory
find: `/home/hpc/soft/hadoop-client/hadoop/libhce/lib': No such file or directory
find: `/home/hpc/soft/hadoop-client/hadoop/libhce/lib': No such file or directory
find: `/home/hpc/soft/hadoop-client/java6/jre/lib/amd64/server': No such file or directory
find: `/home/hpc/soft/hadoop-client/java6/jre/lib/amd64/server': No such file or directory
find: `/home/hpc/soft/hadoop-client/java6/jre/lib/amd64/server': No such file or directory
find: `/home/hpc/soft/hadoop-client/hadoop/libhdfs': No such file or directory
find: `/home/hpc/soft/hadoop-client/hadoop/libhdfs': No such file or directory
find: `/home/hpc/soft/hadoop-client/hadoop/libhdfs': No such file or directory
find: `/usr/local/mpc-1.0/lib': No such file or directory
find: `/usr/local/mpc-1.0/lib': No such file or directory
find: `/usr/local/mpc-1.0/lib': No such file or directory
find: `/usr/local/gmp-5.1.1/lib': No such file or directory
find: `/usr/local/gmp-5.1.1/lib': No such file or directory
find: `/usr/local/gmp-5.1.1/lib': No such file or directory
find: `/usr/local/mpfr-3.1.2/lib': No such file or directory
find: `/usr/local/mpfr-3.1.2/lib': No such file or directory
find: `/usr/local/mpfr-3.1.2/lib': No such file or directory
find: `/usr/local/gcc-4.6.4/lib': No such file or directory
find: `/usr/local/gcc-4.6.4/lib': No such file or directory
find: `/usr/local/gcc-4.6.4/lib': No such file or directory
find: `/home/work/cuda-5.5/lib64': No such file or directory
find: `/home/work/cuda-5.5/lib64': No such file or directory
find: `/home/work/cuda-5.5/lib64': No such file or directory
find: `/home/work/cuda-5.5/lib': No such file or directory
find: `/home/work/cuda-5.5/lib': No such file or directory
find: `/home/work/cuda-5.5/lib': No such file or directory
find: `/home/work/cuda-6.5/lib6': No such file or directory
find: `/home/work/cuda-6.5/lib6': No such file or directory
find: `/home/work/cuda-6.5/lib6': No such file or directory
/dev/mem: Permission denied
  • 有avx2:
========================= Hardware Information =========================
CPU Name               : 
CPU Family             : 6
Socket Number          : 2
Cores Per Socket       : 20
Total Physical Cores   : 40
Total Virtual Cores    : 40
Hyper Threading        : OFF
NUMA Nodes             : 2
-------------------------- Memory Information --------------------------
Installed DIMM number  : 0
Error: The installed DIMMs number does ont match the mapped memory device: 12
Error: The installed DIMMs number does ont match configured clocks: 12
Installed DIMMs Locator: 
Not installed DIMMs    :  
 CPU0_A1 
 CPU0_D1 
 CPU1_A1 
 CPU1_D1
DIMMs max slots        : 16
Error: The max dimm slots do not match the max dimms: 4
free: invalid option -- 'h'
usage: free [-b|-k|-m|-g] [-l] [-o] [-t] [-s delay] [-c count] [-V]
  -b,-k,-m,-g show output in bytes, KB, MB, or GB
  -l show detailed low and high memory statistics
  -o use old format (no -/+buffers/cache line)
  -t display total for RAM + swap
  -s update every [delay] seconds
  -c update [count] times
  -V display version information and exit
Memory Size            : 
free: invalid option -- 'h'
usage: free [-b|-k|-m|-g] [-l] [-o] [-t] [-s delay] [-c count] [-V]
  -b,-k,-m,-g show output in bytes, KB, MB, or GB
  -l show detailed low and high memory statistics
  -o use old format (no -/+buffers/cache line)
  -t display total for RAM + swap
  -s update every [delay] seconds
  -c update [count] times
  -V display version information and exit
Swap Memory Size       : 
free: invalid option -- 'h'
usage: free [-b|-k|-m|-g] [-l] [-o] [-t] [-s delay] [-c count] [-V]
  -b,-k,-m,-g show output in bytes, KB, MB, or GB
  -l show detailed low and high memory statistics
  -o use old format (no -/+buffers/cache line)
  -t display total for RAM + swap
  -s update every [delay] seconds
  -c update [count] times
  -V display version information and exit
Total Memory Size      : 
Max Memory Capacity    : 2304 GB
Configed Clock Speed   : 2666 MHz
-------------------------- Turbo Information  --------------------------
Scaling Driver         : acpi-cpufreq
Warning: Scaling driver is not intel_pstarte, maybe should enable it in BIOS
Turbo Status           : Unknown
CPU Max Frequency      : 2.40 GHz
CPU Min Frequency      : 1.00 GHz
CPU Freq Governor      : performance
========================= Software Information =========================
BIOS Release Date      : 10/19/2017
OS Version             : CentOS release 6.3 (Final)
Kernel Release Version : 3.10.0_3-0-0-12
Kernel Patch Version   : #1 SMP Thu Nov 2 14:22:44 CST 2017
GCC Version            : 4.4.6 20120305 (Red Hat 4.4.6-4)
check_env.sh: line 129: cmake: command not found
CMake Version          :
------------------ Environment Variables Information -------------------
KMP_AFFINITY           : unset
OMP_DYNAMIC            : unset
OMP_NESTED             : unset
OMP_NUM_THREADS        : unset
MKL_NUM_THREADS        : unset
MKL_DYNAMIC            : unset


if [ "`uname -s`" != "Linux" ]; then
echo "Current scenario only support in Linux yet!"
exit 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mac上的docker环境也不适用?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我试过, 主要是 dmidecode这个指令没有,所以不支持,如果在docker里面加进去就可以了。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以新开一个issue,在docker环境下加入dmidecode指令?

num_dimms_installed=0
for dimm_id in `dmidecode |grep Locator|sort -u | awk -F ':' '{print $2}'`; do
num_refered=`dmidecode |grep -c "$dimm_id"`
# the acutal dimm id should be refered only once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

acutal笔误:Actual


# dump all details for fully check
lscpu > lscpu.dump
dmidecode > dmidecode.dump
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以在脚本后面列一下正确的输出么?
包括 #6291 (comment)#6291 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没有完全正确的输出,因为不同平台需要的结果可能不一样。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不需要正确的输出,只是给一个输出的例子。

physical_cores=$((sockets * cores_per_socket))
virtual_cores=`grep 'processor' /proc/cpuinfo | sort -u | wc -l`
numa_nodes=`lscpu |grep "NUMA node(s)"|awk -F':' '{print $2}'|xargs`
echo "CPU Name : `lscpu |grep \"name\" |awk -F':' '{print $2}'|xargs`"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPU Name,我试了两台机器,打出来都是空?

Copy link
Contributor Author

@tensor-tang tensor-tang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, 可以分别传下没有avx2和有avx2的机器,最后dump出来的两个文件吗。我估计是跟系统版本有关。


if [ "`uname -s`" != "Linux" ]; then
echo "Current scenario only support in Linux yet!"
exit 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我试过, 主要是 dmidecode这个指令没有,所以不支持,如果在docker里面加进去就可以了。


# dump all details for fully check
lscpu > lscpu.dump
dmidecode > dmidecode.dump
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没有完全正确的输出,因为不同平台需要的结果可能不一样。

@luotao1
Copy link
Contributor

luotao1 commented Dec 7, 2017

dump.zip
包括了两个机器的两个dump文件,其中有avx2后缀的代表该机器支持avx2

@tensor-tang
Copy link
Contributor Author

Thx。另外

free: invalid option -- 'h'

这句话应该是应为版本不够,我的版本是free from procps-ng 3.3.10
需要加个判断。

@tensor-tang
Copy link
Contributor Author

/dev/mem: Permission denied

没有avx2的那台机器,系统权限不够,dump出来的也是空的。

lscpu的版本可能不一样,所以没有cpu的name

我可加个判断,bypass这种情况。

@luotao1
Copy link
Contributor

luotao1 commented Dec 7, 2017

打印结果如下:因为没有装cmake,所以命令找不到,这里需要判断下么?

check_env.sh: line 153: cmake: command not found
CMake Version          :
========================= Hardware Information =========================
CPU Name               : Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
CPU Family             : 6
Socket Number          : 2
Cores Per Socket       : 20
Total Physical Cores   : 40
Total Virtual Cores    : 40
Hyper Threading        : OFF
NUMA Nodes             : 2
-------------------------- Memory Information --------------------------
Installed DIMM number  : 12
Installed DIMMs Locator:  
 CPU0_A0 
 CPU0_B0 
 CPU0_C0 
 CPU0_D0 
 CPU0_E0 
 CPU0_F0 
 CPU1_A0 
 CPU1_B0 
 CPU1_C0 
 CPU1_D0 
 CPU1_E0 
 CPU1_F0
Not installed DIMMs    :  
 CPU0_A1 
 CPU0_D1 
 CPU1_A1 
 CPU1_D1
DIMMs max slots        : 16
Memory Size            : 376.3 GB
Swap Memory Size       : 0.0 GB
Total Memory Size      : 376.3 GB
Max Memory Capacity    : 2304 GB
Configed Clock Speed   : 2666 MHz
-------------------------- Turbo Information  --------------------------
Scaling Driver         : acpi-cpufreq
Warning: Scaling driver is not intel_pstarte, maybe should enable it in BIOS
Turbo Status           : Unknown
CPU Max Frequency      : 2.40 GHz
CPU Min Frequency      : 1.00 GHz
CPU Freq Governor      : performance
========================= Software Information =========================
BIOS Release Date      : 10/19/2017
OS Version             : CentOS release 6.3 (Final)
Kernel Release Version : 3.10.0_3-0-0-12
Kernel Patch Version   : #1 SMP Thu Nov 2 14:22:44 CST 2017
GCC Version            : 4.4.6 20120305 (Red Hat 4.4.6-4)
check_env.sh: line 153: cmake: command not found
CMake Version          :
------------------ Environment Variables Information -------------------
KMP_AFFINITY           : unset
OMP_DYNAMIC            : unset
OMP_NESTED             : unset
OMP_NUM_THREADS        : unset
MKL_NUM_THREADS        : unset
MKL_DYNAMIC            : unset

@tensor-tang
Copy link
Contributor Author

Thanks for catching this, it's done.

Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tensor-tang tensor-tang merged commit c096130 into PaddlePaddle:develop Dec 7, 2017
@tensor-tang tensor-tang deleted the check_env branch December 7, 2017 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add check env script
2 participants