HPL Installation Guild

HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers

Linpack Benchmark is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. 同時為全球 500 大超級電腦 (TOP500 Supercomputer) 衡量系統效能的指標.
本網頁將介紹如何安裝與調校 HPL 參數.

準備一個叢集環境

使用實體機器建立叢集環境或虛擬機器建立虛擬叢集環境，使其得以運行 MPI 程式。在此我們使用 virsh/virt-manager (libvirt) 來建立虛擬叢集環境。
第一部虛擬機器為 br0 的 IP+40
第二部虛擬機器為 br0 的 IP+80 (第二部虛擬機器等安裝完 HPL 我們再建立)

安裝 HPL

由於 HPL 是用以測試分散式系統效能的軟體, 因此使用到 Massage Passing Interface MPI, 以及 Basic Linear Algebra Subprograms BLAS 或 Vector Signal Image Processing Library VSIPL. 在此我們選用 openMPI 與 ATLAS (Automatically Tuned Linear Algebra Software).

$ ssh -X cloud@VM1_IP
$ sudo aptitude update
$ sudo aptitude install libopenmpi-dev openmpi-bin build-essential libatlas3-base
$ wget http://www.netlib.org/benchmark/hpl/hpl-2.0.tar.gz
$ tar zxvf hpl-2.0.tar.gz
$ mv hpl-2.0 HPL
$ cd HPL
$ cp setup/Make.Linux_ATHLON_CBLAS Make.LinuxKVM
$ nano Make.LinuxKVM
$ diff Make.LinuxKVM setup/Make.Linux_ATHLON_CBLAS
64c64
< ARCH         = LinuxKVM
---
> ARCH         = Linux_ATHLON_CBLAS
70c70
< TOPdir       = $(HOME)/HPL
---
> TOPdir       = $(HOME)/hpl
84c84
< MPdir        = /usr/lib/openmpi
---
> MPdir        = /usr/local/mpi
86c86
< MPlib        = $(MPdir)/lib/libmpi.so
---
> MPlib        = $(MPdir)/lib/libmpich.a
95c95
< LAdir        = /usr/lib/atlas-base/
---
> LAdir        = $(HOME)/netlib/ARCHIVES/Linux_ATHLON
97c97
< LAlib        = $(LAdir)/libcblas.so.3 $(LAdir)/libatlas.so.3
---
> LAlib        = $(LAdir)/libcblas.a $(LAdir)/libatlas.a
$ make arch=LinuxKVM
$ echo $?
0
$ make install arch=LinuxKVM
$ echo $?
0

準備叢集環境 (續)

關閉虛擬機器
Clone 虛擬機器
修改第二部虛擬機器 /etc/hostname, /etc/hosts, /etc/network/interfaces
分別將兩部虛擬機器開啟

多節點設置

多節點共同運算時需透過 ssh 遠端連線完成, 因此必須完成 ssh 相互認證達到雙方免輸入密碼才能順利運作.

ssh 遠端存取認證 (免密碼)

$ ssh-keygen 
Generating public/private rsa key pair.
Enter file in which to save the key (/home/cloud/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/cloud/.ssh/id_rsa.
Your public key has been saved in /home/cloud/.ssh/id_rsa.pub.
The key fingerprint is:
0e:1d:9b:f2:74:bc:da:14:be:88:10:13:fd:c9:4a:5a cloud@demo
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|     .           |
|    . . .        |
|     . + *       |
|    o E S +      |
|     * B o o     |
|    o . o +      |
|     . . = .     |
|      . o o      |
+-----------------+

將 public key 傳到其他節點

$ ssh-copy-id cloud@node1
$ ssh-copy-id cloud@node2

同理將所有節點做過一次.

建立 node list

$ cd ~/HPL/bin/LinuxKVM
$ nano node.list
VM1_IP	slots=4
VM2_IP	slots=4

修改 HPL.dat, N, P x Q 等參數後傳至各個節點
```
$ scp HPL.dat VM2_IP:`pwd`
```

執行 HPL

$ mpiexec -np 8 -hostfile node.list ./xhpl

HPL Tuning

$ cd ~/HPL/bin/LinuxKVM/
$ more HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
4            # of problems sizes (N)
29 30 34 35  Ns
4            # of NBs
1 2 3 4      NBs
0            PMAP process mapping (0=Row-,1=Column-major)
3            # of process grids (P x Q)
2 1 4        Ps
2 4 1        Qs
16.0         threshold
3            # of panel fact
0 1 2        PFACTs (0=left, 1=Crout, 2=Right)
2            # of recursive stopping criterium
2 4          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
3            # of recursive panel fact.
0 1 2        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
0            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
0            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
$ mpiexec -np 4 ./xhpl | tee "HPL-result.txt"

參數說明

第 1 行: (沒有使用) 通常可以用來描述此檔案的參數.
```
HPLinpack benchmark input file
```

第 2 行: (沒有使用) 同上.

Innovative Computing Laboratory, University of Tennessee

第 3 行: Output 檔案名稱, 若第 4 行非 6 且非 7 時.
```
HPL.out      output file name (if any)
```
第 4 行: 設定輸出方式, 6 為 stdout, 7 為 stderr, 其他整數則輸出至第 3 行之檔案.
```
6            device out (6=stdout,7=stderr,file)
```
第 5 行: 此行表示有多少數量的矩陣大小 (第 6 行) 會執行, 應小於或等於 20.
```
4            # of problems sizes (N)
```
第 6 行: 此行表示矩陣的大小.
```
29 30 34 35  Ns
```
$N$ 大小為記憶體總量之 80%, 此為經驗法則.
$N = \sqrt{\frac{Memory Size(bytes)}{64(bits)}} * 80\% $
$N * N * 8 = Memory Size(bytes) * 80\%$
第 7 行: 此行表示有多少數量的區塊大小 (第 8 行) 會執行, 應小於或等於 20.
```
4            # of NBs
```
第 8 行: 矩陣分割區塊的大小
```
1 2 3 4      NBs
```
NB 值主要是經過實際測試來得到最佳值, 通常為 256 以下. $NB*8$ 為 Cache line 的倍數. 如 L2 cache 為 1024K, 則 NB 可設置為 128.
第 9 行: Row-major 適用於節點數較多而節點之 CPU 數較少的系統; Column-major 適用於節點數較少而節點之 CPU 數較多的系統 (如: 超級電腦).
```
0            PMAP process mapping (0=Row-,1=Column-major)
```
第 10~12 行:
```
3            # of process grids (P x Q)
2 1 4        Ps
2 4 1        Qs
```
$P * Q = Total number of cores = number of processes$
一般來說, P 值盡量取小避免頻繁與其他節點交換資訊. $ P \leq Q$
$P = 2^n $, L 分解法使用 Binary Exchange 緣故.
第 13 行: 此行表示測試的精準度，通常不需要修改。
```
16.0         threshold
```

第 14~21 行: 表示 L 分解方式. 測試經驗, NDIVs 2 比較理想, NBMINs 4 或 8 都不錯. RFACTs 與 PFACTs 對於效能影響不大.

3            # of panel fact
0 1 2        PFACTs (0=left, 1=Crout, 2=Right)
2            # of recursive stopping criterium
2 4          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
3            # of recursive panel fact.
0 1 2        RFACTs (0=left, 1=Crout, 2=Right)

第 22, 23 行: HPL 提供 6 種廣播方式. 前 4 種適合於高速網路, 後兩種將數據切割後再傳送, 適合網路速度較慢之環境. 一般建議改 2, 2rg.
```
1            # of broadcast
0            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
```
第 24, 25 行: Massage passing 的深度. 與機器的配置有關.
```
1            # of lookahead depth
0            DEPTHs (>=0)
```
第 26, 27 行: 表示 U 的廣播方式. U 的廣播為 column 方向. 分為三種方式 Binary Exchange, Long 以及混合
```
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
```

第 28, 29 行: 表示 L 與 U 的存放格式. transposed 為按列存放, 反之按行存放.

0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form

第 30 行: 通常不需要修改. 僅在 26 行設置為 1 或 2 時才有作用.
```
1            Equilibration (0=no,1=yes)
```

第 31 行: 記憶體對齊方式.

8            memory alignment in double (> 0)

ChiSheng Su