Benchmarks

Arm Architectures -- Cellphones, Servers and Embedded boards

1. Test bed settings

We evaluate performance with 5 widely-used models: VGG-16, GoogleNet (Inception-V1), ResNet-50, MobileNet V1, SqueezeNet and DenseNet-121, respectively. Our testbed has 6 different ARM devices, including 2 cellphones, 2 server-class processors and 2 embedded dev boards, whose specifications can be found in the following table. All results are averaged from 20 consecutive runs.

Table1: Specifications

Device	Processor	#CPUs @ Clock Speed	CPU Arch.	Memory (MB)	OS	SOC Power
Samsung S8	Snapdragon 835	4 @ 2.45Ghz + 4 @ 1.90GHz	Kryo	4GB	Android 7.0	~5W
iPhone 7	A10 Fusion	2 @ 2.34Ghz + 2 @ 1.05GHz	Hurricane	2GB	iOS 11.1	~5W
Huawei D05	Hi1616	2 * 32 @ 2.40GHz	Cortex-A72	256GB	Ubuntu 16.04	>100W
Phytium FT1500A/16	FTC660	16 @ 1.50GHz	Earth	64GB	Kylin 5.0	35W
RK3399	RK3399	2 @ 1.8Ghz + 4 @ 1.40GHz	Cortex-A72	2GB	Debian	6.05W
Raspberry Pi3	Broadcom BCM2837	4 @ 1.2Ghz	Cortex-A53	1GB	Ubuntu 16.04	~5W

2. VGG-16 end-to-end benchmark

The VGG series models run through many unit-stride 3x3 convolution layers (operators), therefore can be perfectly accelerated by the Winograd algorithm. We use VGG-16 as reference model to evaluate our Winograd implementation performance.

Table1: Avg. inference time (ms) on Arm-based CPUs.

Devices/Cores	1	2	4	8	GPU
Galaxy S8	925	630	489
iPhone 7	374	284
Huawei D05	755	399	226	149
Phytium FT1500A	1769	1020	639	444
RK3399	1673	1420
TensorFlow lite on iPhone7		905
ACL on RK3399			4103		1645
TVM on RK3399			-		1293
Intel Movidius*			812

Intel Movidius operates at FP16 precision.

3. VGG-16 layer-by-layer benchmark

We conduct VGG-16 layer-by-layer benchmark on Samsung S8 and iPhone 7. We have tested two FeatherCNN compute patterns: IM2COL + GEMM and Winograd F(2x2,3x3)/F(6x6,3x3), against NNPACK Winograd F(6x6,3x3). We should note that the NNPACK is called by a small standalone benchmark program in order to eliminate framework overhead. It performs better than Caffe/Caffe2 integrations.

All the convolution layers use 3x3 filters with unit-stride and unit-paddings. Detailed layer configurations are as follows, where K is output channels, C is input channels, H is image height, W is image width:

VGG-16 conv layer configurations

Layer Name	K	C	H	W
conv1.1	3	64	224	224
conv1.2	64	64	224	224
conv2.1	64	128	112	112
conv2.2	128	128	112	112
conv3.1	128	256	56	56
conv3.2-3.3	256	256	56	56
conv4.1	256	512	28	28
conv4.2-4.3	512	512	28	28
conv5.1-5.3	512	512	14	14

Performance is measured by GFLOPS, which is (2.0 * K * C * H * W * 3 * 3) / Time / 1.0e9. Note that the Winograd algorithm actually conducts less computing operations than standard algorithm, therefore the measured performance is an "effective" performance, which may exceed device compute capability.

iPhone7

Samsung Galaxy S8

4. Thread Scalability on many-core Arm servers

4.1 Huawei D05 server

We test thread scalability on a dual-socket Huawei D05 server. This server equips with 64 Cortex-A72 cores integrated in two Hi1616 SOCs. The two cores are connected via token-rings on each SOC. Each cores has 48kB L1-Instruction and 32kB L1-Data cache, Every four cores share 1MB L2 cache and all cores share 32MB L3 cache. We are told that the L2 cache bandwidth is insufficient for four concurrent cores. Therefore, peak performance usually comes at 32 threads. We have tested our work, Caffe + OpenBLAS and Caffe2 + Eigen. The test was finished by the end of 2017, performance may change a lot after that time. The FeatherCNN speedup is calculated by dividing best peak performance achieved at any thread numbers.

Average inference time (ms) - FeatherCNN-F(2x2,3x3)

Network / #Cores	1	2	4	8	16	32	64
VGG-16	1333	697	385	218	157	117	102
GoogleNet	333	210	154	125	126	151	230
ResNet-50	573	356	187	117	104	65	194
SqueezeNet	149	79	44	28	29	35	67
MobileNet	124	70	42	36	34	52	76
DenseNet-121	517	273	156	98	113	160	331

Average inference time (ms) - Caffe + OpenBLAS

Network / #Cores	1	2	4	8	16	32	64	FeatherCNN Speedup
VGG-16	3329	2227	1443	1108	1137	2109	3721	10.86
GoogleNet	1028	929	861	831	822	848	857	13.7
Resnet-50	728	490	347	278	252	346	365	3.88
SqueezeNet	190	127	92	76	74	84	92	1.68
MobileNet	211	166	146	139	137	153	184	4.03
DenseNet-121	865	593	438	373	354	655	856	3.08

Average inference time (ms) - Caffe2 + Eigen

Network / #Cores	1	2	4	8	16	32	64	FeatherCNN Speedup
VGG-16	3267	2173	1550	1310	1385	1323	1401	12.84
GoogleNet	351	347	267	306	894	2422	3938	4.45
Resnet-50	869	549	374	262	149	355	724	2.29
SqueezeNet	91	65	55	87	221	628	723	1.25
MobileNet	174	139	110	90	110	171	592	2.65

4.2 Embedded systems

Embedded systems usually adopt energy efficient core designs and low-power memory systems, which indicates low floating-point compute capability as well as small memory bandwidth. We have run benchmarks on three embedded development boards: the Raspberry Pi 3B, Firefly RK3399 and Nvidia Tegra X2, whose specifications can be found in the following table.

Device Name	Core Specs	Mem Cap.
Rasp Pi 3B	4 Cortex-A53 @ 1.2GHz	2GB
RK3399	2 Cortex-A72 @ 1.8 GHz + 4 Cortex-A53 @ 1.4GHz	4GB
Nvidia TX2	2 Denvor @ GHz + 4 Cortex-A57 @ GHz	8GB

As some chips build big.Little architecture for energy efficiency, we first run benchmarks on the "big" cores, then "Little" cores, and finally enable all cores. Performance usually catches its peak with all "big" cores. "Little" cores doesn't help much as the scheduling overhead diminishes the benefits of extra compute capability.

The benchmark results are as follows, all data are inference time measured in milliseconds:

Raspberry Pi 3B - FeatherCNN

Network / #Cores	1	2	4
VGG-16	-	-	-
GoogleNet	1058	642	809
ResNet-50	2107	1255	1540
SqueezeNet	638	399	501
MobileNet	451	275	206
DenseNet-121	630	396	459

Raspberry Pi 3B - Caffe + OpenBLAS

Waiting for test

Raspberry Pi 3B - Caffe2 + NNPACK

Waiting for test

RK3399 - FeatherCNN

Network / #Cores	1	2	1	2	4	all	Memory (MB)
VGG-16	2268	1620	6122	3422	2269	1932	904
GoogleNet	416	250	927	524	333	294	168
Resnet-50	857	517	1834	1009	671	555	466
SqueezeNet	236	144	539	315	210	172	404
MobileNet	242	137	487	271	165	153	176
DenseNet-121	842	543	1854	1050	686	543	111

Nvidia TX2 - FeatherCNN

Network / #Cores	1	2	1	2	4	all
VGG-16	1325	706	2540	1507	1226	844
GoogleNet	274	146	366	206	127	105
ResNet-50	480	266	759	417	261	215
SqueezeNet	88	115	73	61	204	153
MobileNet	156	87	211	116	68	56
DenseNet-121	-	-	-	-	-	-

Nvidia TX2 - Caffe2 + NNPACK

Network / #Cores	1	2	1	2	4	all
VGG-16	5876	3569	5900	3587	6253	5602
GoogleNet	816	560	757	524	4363	4313
ResNet-50	1396	753	1394	757	5841	6108
SqueezeNet	170	116	172	124	899	782
MobileNet-V2	57671	-	-	-		-
DenseNet-121	1223	720	1189	727	10970	10755

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks

Benchmarks

Arm Architectures -- Cellphones, Servers and Embedded boards

1. Test bed settings

2. VGG-16 end-to-end benchmark

3. VGG-16 layer-by-layer benchmark

4. Thread Scalability on many-core Arm servers

4.1 Huawei D05 server

4.2 Embedded systems

Clone this wiki locally