Turing RK1 - 32GB #38

geerlingguy · 2024-02-22T16:01:06Z

Basic information

Board URL (official): https://turingpi.com/product/turing-rk1/
Board purchased from: (Review sample, but I also pre-ordered four 8GB modules)
Board purchase date: 2023-08-15 (pre-order placed)
Board specs (as tested): 32 GB RAM / 32 GB eMMC
Board price (as tested): $299 (+ $12 heatsink)

Linux/system information

# output of `neofetch`
            .-/+oossssoo+/-.               ubuntu@ubuntu 
        `:+ssssssssssssssssss+:`           ------------- 
      -+ssssssssssssssssssyyssss+-         OS: Ubuntu 22.04.3 LTS aarch64 
    .ossssssssssssssssssdMMMNysssso.       Host: Turing Machines RK1 
   /ssssssssssshdmmNNmmyNMMMMhssssss/      Kernel: 5.10.160-rockchip 
  +ssssssssshmydMMMMMMMNddddyssssssss+     Uptime: 10 mins 
 /sssssssshNMMMyhhyyyyhmNMMMNhssssssss/    Packages: 702 (dpkg) 
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Shell: bash 5.1.16 
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Terminal: /dev/pts/0 
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   CPU: (8) @ 1.800GHz 
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   Memory: 249MiB / 31787MiB 
+sssshhhyNMMNyssssssssssssyNMMMysssssss+
.ssssssssdMMMNhsssssssssshNMMMdssssssss.                           
 /sssssssshNMMMyhhyyyyhdNMMMNhssssssss/                            
  +sssssssssdmydMMMMMMMMddddyssssssss+
   /ssssssssssshdmNNNNmyNMMMMhssssss/
    .ossssssssssssssssssdMMMNysssso.
      -+sssssssssssssssssyyyssss+-
        `:+ssssssssssssssssss+:`
            .-/+oossssoo+/-.

# output of `uname -a`
Linux ubuntu 5.10.160-rockchip #30 SMP Mon Jan 29 02:38:59 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

Benchmark results

CPU

Geekbench 6: (795 single / 3157 multi - https://browser.geekbench.com/v6/cpu/5027059)
59.810 Gflops @ 3.30 Gflops/W (geerlingguy/top500-benchmark HPL result: Benchmark Turing Machines RK1 (RK3588 octo-core) top500-benchmark#22)

Power

Methodology: Since I don't have a Jetson carrier board I can use with a single SoM slot for the RK1, I am going to boot the Turing Pi 2 board with no modules installed, wait 5 minutes, and get a power reading. Then I will insert a node in slot 1, and get a power reading, then add another in slot 2, and get a reading. I should be able to see how the power consumption changes in those scenarios, and hopefully just be able to subtract the Turing Pi 2 board-alone-power-consumption. It's not a perfect methodology, but what are you gonna do in this case?

Idle power draw (at wall): 2.85W*
Maximum simulated power draw (stress-ng --matrix 0): 15 W
During Geekbench multicore benchmark: 14.4 W
During top500 HPL benchmark: 18.1 W

This was measured on the SoM-level, not accounting for power supply losses or carrier board power consumption. Including that, at least with the Turing Pi 2, idle power consumption would be around 5W.

The other power metrics were measured board-level, incorporating power supply losses and BMC power draw, totaling around 2W.

Disk

32GB eMMC built-in

Benchmark	Result
fio 1M sequential read	251.00 MB/s
iozone 1M random read	259.00 MB/s
iozone 1M random write	105.29 MB/s
iozone 4K random read	22.93 MB/s
iozone 4K random write	39.00 MB/s

1TB Teamgroup NVMe SSD (TM8FP4001T)

Benchmark	Result
fio 1M sequential read	3093.00 MB/s
iozone 1M random read	1417.15 MB/s
iozone 1M random write	1577.96 MB/s
iozone 4K random read	32.30 MB/s
iozone 4K random write	71.68 MB/s

curl https://raw.githubusercontent.com/geerlingguy/pi-cluster/master/benchmarks/disk-benchmark.sh | sudo bash

Run benchmark on any attached storage device (e.g. eMMC, microSD, NVMe, SATA) and add results under an additional heading. Download the script with curl -o disk-benchmark.sh [URL_HERE] and run sudo DEVICE_UNDER_TEST=/dev/sda DEVICE_MOUNT_PATH=/mnt/sda1 ./disk-benchmark.sh (assuming the device is sda).

Also consider running PiBenchmarks.com script.

Network

iperf3 results:

iperf3 -c $SERVER_IP: 942 Mbps
iperf3 --reverse -c $SERVER_IP: 927 Mbps
iperf3 --bidir -c $SERVER_IP: 937 Mbps up, 230 Mbps down

(Be sure to test all interfaces, noting any that are non-functional.)

GPU

TODO: Haven't determined standardized benchmark yet. See Issue #2.

Memory

tinymembench results:

Click to expand memory benchmark result

tinymembench v0.4.10 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :  12010.5 MB/s (0.2%)
 C copy backwards (32 byte blocks)                    :  11994.0 MB/s
 C copy backwards (64 byte blocks)                    :  11989.9 MB/s
 C copy                                               :  12374.0 MB/s
 C copy prefetched (32 bytes step)                    :  12485.4 MB/s
 C copy prefetched (64 bytes step)                    :  12517.2 MB/s
 C 2-pass copy                                        :   5322.3 MB/s (0.2%)
 C 2-pass copy prefetched (32 bytes step)             :   8041.3 MB/s
 C 2-pass copy prefetched (64 bytes step)             :   8495.3 MB/s
 C fill                                               :  30175.2 MB/s (0.1%)
 C fill (shuffle within 16 byte blocks)               :  30166.6 MB/s
 C fill (shuffle within 32 byte blocks)               :  30169.2 MB/s
 C fill (shuffle within 64 byte blocks)               :  30167.5 MB/s
 NEON 64x2 COPY                                       :  12599.5 MB/s
 NEON 64x2x4 COPY                                     :  12548.0 MB/s
 NEON 64x1x4_x2 COPY                                  :   5758.3 MB/s (0.5%)
 NEON 64x2 COPY prefetch x2                           :  11318.2 MB/s
 NEON 64x2x4 COPY prefetch x1                         :  11703.9 MB/s
 NEON 64x2 COPY prefetch x1                           :  11448.4 MB/s
 NEON 64x2x4 COPY prefetch x1                         :  11705.6 MB/s
 ---
 standard memcpy                                      :  12579.7 MB/s
 standard memset                                      :  30194.0 MB/s (0.1%)
 ---
 NEON LDP/STP copy                                    :  12594.4 MB/s
 NEON LDP/STP copy pldl2strm (32 bytes step)          :  12427.5 MB/s
 NEON LDP/STP copy pldl2strm (64 bytes step)          :  12454.2 MB/s
 NEON LDP/STP copy pldl1keep (32 bytes step)          :  12518.7 MB/s
 NEON LDP/STP copy pldl1keep (64 bytes step)          :  12514.2 MB/s
 NEON LD1/ST1 copy                                    :  12545.6 MB/s
 NEON STP fill                                        :  30157.8 MB/s
 NEON STNP fill                                       :  30190.9 MB/s
 ARM LDP/STP copy                                     :  12583.4 MB/s
 ARM STP fill                                         :  30162.0 MB/s (0.2%)
 ARM STNP fill                                        :  30183.5 MB/s

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    0.0 ns          /     0.0 ns 
    131072 :    1.1 ns          /     1.5 ns 
    262144 :    2.2 ns          /     2.8 ns 
    524288 :    3.3 ns          /     4.0 ns 
   1048576 :    9.9 ns          /    13.0 ns 
   2097152 :   13.3 ns          /    15.6 ns 
   4194304 :   35.6 ns          /    52.4 ns 
   8388608 :   76.7 ns          /   103.7 ns 
  16777216 :   99.1 ns          /   121.0 ns 
  33554432 :  110.7 ns          /   127.4 ns 
  67108864 :  117.6 ns          /   131.1 ns

`sbc-bench` results

Run sbc-bench and paste a link to the results here: http://sprunge.us/wornU5

Phoronix Test Suite

Results from pi-general-benchmark.sh:

pts/encode-mp3: 12.004 sec
pts/x264 4K: 5.65 fps
pts/x264 1080p: 24.12 fps
pts/phpbench: 416192
pts/build-linux-kernel (defconfig): 1145.534 sec

The text was updated successfully, but these errors were encountered:

geerlingguy · 2024-02-23T15:05:52Z

Also benchmarking a full cluster of 4x RK1 nodes over on top500-benchmark: geerlingguy/top500-benchmark#27

geerlingguy · 2024-02-23T15:09:45Z

On one of the nodes, I was trying to update software, and got:

E: Release file for https://ppa.launchpadcontent.net/jjriek/rockchip/ubuntu/dists/jammy/InRelease is not valid yet (invalid for another 16h 59min 14s). Updates for this repository will not be applied.

So I had to manually force a time re-sync:

sudo timedatectl set-ntp off
sudo timedatectl set-ntp on

geerlingguy · 2024-02-23T17:11:42Z

On my Turing Pi 2 board, I've upgraded the firmware to 2.0.5 (it looked like it errored out when I tried upgrading, but if I restarted the BMC by power cycling, it had updated), and I installed Ubuntu on all four nodes and have the latest updates in place.

Nodes 1, 3, and 4 will use PWM to spin down the fans and adjust the speed for the CPU/SoC temp... but node 2 for some reason always spins the fan full speed.

If I check the fan speeds I see:

ubuntu@turing2:~$ cat /sys/devices/platform/pwm-fan/hwmon/hwmon8/pwm1
100

I've asked on Discord if anyone knows what might be causing the PWM control to not work on a fresh install.

geerlingguy · 2024-02-23T17:45:20Z

I have the power monitor hooked up now. It looks like in idle state, with no RK1s booted, the board draws 2.6W to power the BMC. Here are some metrics in the default state—with NVMe SSDs on each node:

BMC only (no nodes powered on): 2.6W
BMC + 1 node booted and idle: 7.8W (+5.2W)
BMC + 2 nodes booted and idle: 12.6W (+4.8W)
BMC + 3 nodes booted and idle: 18.4W (+5.8W)
BMC + 4 nodes booted and idle: 24.2W (+5.8W)

After shutting down Ubuntu on each node, but leaving the slots powered on, the board consumes 3.7W. After I power off each slot in the Turing Pi BMC, the BMC alone draws 2.6W again. So the slots power rails (with the SoM shut down) consume about 0.25W each.

Next I'll test with no NVMe SSDs plugged in.

geerlingguy · 2024-02-23T18:03:29Z

Without NVMe SSDs:

BMC only (no nodes powered on): 2.6W
BMC + 1 node booted and idle: 5.1W (+2.5W)
BMC + 2 nodes booted and idle: 7.5W (+2.4W)
BMC + 3 nodes booted and idle: 10.5W (+3W)
BMC + 4 nodes booted and idle: 14W (+3.5W)

Next I'll test performance vs ondemand cpufreq scaling.

geerlingguy · 2024-02-23T18:05:42Z

$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
conservative ondemand userspace powersave performance schedutil

$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
performance
performance
performance
performance
performance
performance
performance
performance

$ echo ondemand | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
ondemand

Power went from 5.2W to 5.1W. So... not really worth messing with the default IMO for this particular SoM, unless maybe you want to switch to powersave or something for a weird use case.

Confirmed that power consistently jumped to 5.2W with powersave, and down to 5.1W with ondemand.

geerlingguy · 2024-02-23T18:59:25Z

Single node HPL performance is 59.810 Gflops at 3.30 Gflops/W efficiency: geerlingguy/top500-benchmark#27 (comment)

geerlingguy · 2024-02-23T21:26:03Z

Full cluster of 4x 32GB nodes, with Ps/Qs being 4/8: 224.60 Gflops, 73 W, for 3.08 Gflops/W

ThomasKaiser · 2024-02-27T09:06:15Z

power monitor

It's worth noting that power consumption depends on settings/software.

For example when running RK3588 devices with mainline (I tried kernel 6.8) then my Rock 5B shows an idle consumption over twice as high and throttling behaviour is nuts too (I'm updating my Rock 5B review soon and will then add a link to actual measurements). As such it would be great if you could add a note about the kernel version you were running (5.10 BSP, 6.1 BSP or 6.x mainline) to consumption numbers :)

geerlingguy · 2024-02-27T14:02:45Z

@ThomasKaiser - That information is always up in the Linux/system information section :)

I always try to run all tests on the vendor-recommended OS version, and if one doesn't exist (which is thankfully more rare these days), Armbian.

Also, for completeness, the power monitor I'm using is a Thirdreality Zigbee 'smart' plug, I've been setting up a few of these around my test benches for the convenience.

alm7640 · 2024-03-05T14:28:39Z

Did you ever create a playbook for the RK1 boards?

geerlingguy · 2024-03-05T19:31:26Z

@alm7640 - See: https://github.com/geerlingguy/pi-cluster — there are a few tweaks needed, and I may do a follow-up live stream with those tweaks added :)

hispanico · 2024-03-08T11:59:44Z

Did you ever test the NPU on RK1 module?

geerlingguy · 2024-03-08T22:52:07Z

No, I might test it with Frigate to see if it can perform as well as a Coral on one of my Pi 4/5's.

geerlingguy · 2024-04-24T14:59:08Z

I've also set up Drupal in a Kubernetes cluster on the RK1 boards, and with real-world performance compared to CM4, some things are a little faster, and some things are a lot faster:

Benchmark	DeskPi Super 6C + CM4	Turing Pi 2 + RK1
wrk (anonymous)	78.65	104.30
ab (authenticated)	72.42	165.55

Full details here: geerlingguy/pi-cluster#10 (comment)

alm7640 · 2024-04-24T16:11:23Z

Great news on the RK1 Ubuntu fixes for playbook. Plan to test it out asap.

geerlingguy · 2024-04-26T18:09:55Z

Video posted today: Meet the new SBC Linux Cluster King!

geerlingguy · 2024-05-24T14:15:07Z

The RK1 gets a mentions in today's video on the LattePanda Mu.

geerlingguy mentioned this issue Feb 22, 2024

Turing RK1 - 16GB alpha board revision #25

Closed

geerlingguy added a commit that referenced this issue Feb 22, 2024

Issue #38: Add Turing Pi RK1

2d49750

geerlingguy mentioned this issue Feb 23, 2024

Benchmark Turing Machines RK1 (RK3588 octo-core) geerlingguy/top500-benchmark#22

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turing RK1 - 32GB #38

Turing RK1 - 32GB #38

geerlingguy commented Feb 22, 2024 •

edited

Loading

geerlingguy commented Feb 23, 2024

geerlingguy commented Feb 23, 2024

geerlingguy commented Feb 23, 2024

geerlingguy commented Feb 23, 2024

geerlingguy commented Feb 23, 2024

geerlingguy commented Feb 23, 2024 •

edited

Loading

geerlingguy commented Feb 23, 2024

geerlingguy commented Feb 23, 2024

ThomasKaiser commented Feb 27, 2024

geerlingguy commented Feb 27, 2024 •

edited

Loading

alm7640 commented Mar 5, 2024

geerlingguy commented Mar 5, 2024

hispanico commented Mar 8, 2024

geerlingguy commented Mar 8, 2024

geerlingguy commented Apr 24, 2024

alm7640 commented Apr 24, 2024 •

edited

Loading

geerlingguy commented Apr 26, 2024

geerlingguy commented May 24, 2024

Turing RK1 - 32GB #38

Turing RK1 - 32GB #38

Comments

geerlingguy commented Feb 22, 2024 • edited Loading

Basic information

Linux/system information

Benchmark results

CPU

Power

Disk

32GB eMMC built-in

1TB Teamgroup NVMe SSD (TM8FP4001T)

Network

GPU

Memory

sbc-bench results

Phoronix Test Suite

geerlingguy commented Feb 23, 2024

geerlingguy commented Feb 23, 2024

geerlingguy commented Feb 23, 2024

geerlingguy commented Feb 23, 2024

geerlingguy commented Feb 23, 2024

geerlingguy commented Feb 23, 2024 • edited Loading

geerlingguy commented Feb 23, 2024

geerlingguy commented Feb 23, 2024

ThomasKaiser commented Feb 27, 2024

geerlingguy commented Feb 27, 2024 • edited Loading

alm7640 commented Mar 5, 2024

geerlingguy commented Mar 5, 2024

hispanico commented Mar 8, 2024

geerlingguy commented Mar 8, 2024

geerlingguy commented Apr 24, 2024

alm7640 commented Apr 24, 2024 • edited Loading

geerlingguy commented Apr 26, 2024

geerlingguy commented May 24, 2024

geerlingguy commented Feb 22, 2024 •

edited

Loading

`sbc-bench` results

geerlingguy commented Feb 23, 2024 •

edited

Loading

geerlingguy commented Feb 27, 2024 •

edited

Loading

alm7640 commented Apr 24, 2024 •

edited

Loading