Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save wangxianhe/1236d22319ddff6b42b1dcfc7f60eac7 to your computer and use it in GitHub Desktop.
Save wangxianhe/1236d22319ddff6b42b1dcfc7f60eac7 to your computer and use it in GitHub Desktop.
CentOS 7 高性能计算(HPC)集群搭建过程

文档说明

此文档内容:在CentOS7系统上搭建高性能计算HPC集群。

准备条件

  1. 至少两台服务器或者电脑(没有的,使用VMware虚拟机代替也行。此文档为了操作方便,使用VMware虚拟机代替做示范。)  

  2. CentOS7操作系统软件镜像

  3. torque+maui 作业调度系统软件

实施步骤

1、安装操作系统

 在两台服务器裸机上安装CentOS7操作系统,最小化安装即可,速度快。

 安装好系统以后,配置好网络。

 在两台服务器节点的/etc/hosts文件中做好IP解析。可先在node1的/etc/hosts做好IP解析,然后复制到node2上。例如:

  [root@localhost ~]# cat /etc/hosts  
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

172.22.123.66 node1
172.22.123.68 node2

关掉防火墙和selinux。

 

2、配置并行计算环境

配置nis网络信息服务

nis服务用于同步节点间账号信息。 node1做nis server端;node2做nis client端,同步node1的账号信息。 NIS详细步骤参考:https://gist.github.com/wangxianhe/d26b1c8a08ea7f324543728ca3d28c24

配置nfs网络文件系统

nfs服务用于共享文件信息。 node1做nfs server端;node2做nfs client端,挂载node1的共享目录。 NFS详细步骤参考:https://gist.github.com/wangxianhe/d42c0b777287f215d5c18757fc0e0308

服务器节点内ssh免密码访问

操作比较简单,参考:https://gist.github.com/wangxianhe/d9bb9a4006bc0ec456c0ddb62d69a1a8

3、安装TORQUE资源管理软件

由于详细安装步骤繁杂,这里只做简要说明,详细步骤请参照官方文档:

TORQUE:http://docs.adaptivecomputing.com/torque/6-1-2/adminGuide/torque.htm

先去官网下载软件包:

TORQUE:http://www.adaptivecomputing.com/support/download-center/torque-download/

这里下载的版本是: torque-6.1.2

3.1 配置管理节点(Torque Server host)

安装依赖包

[root]# yum install libtool libcgroup-tools openssl-devel libxml2-devel boost-devel gcc gcc-c++

  安装hwloc

When cgroups are enabled (recommended), hwloc version 1.9.1 or later is required.

Download hwloc-1.9.1.tar.gz from https://www.open-mpi.org/software/hwloc/v1.9.

yum install gcc make
tar -xzvf hwloc-1.9.1.tar.gz
cd hwloc-1.9.1
 ./configure
 make
 make install
 echo /usr/local/lib >/etc/ld.so.conf.d/hwloc.conf
 ldconfig

下载torque-6.1.2.tar.gz

[root]# yum install wget
[root]# wget http://www.adaptivecomputing.com/download/torque/torque-6.1.2.tar.gz -O torque-6.1.2.tar.gz
[root]# tar -xzvf torque-6.1.2.tar.gz
[root]# cd torque-6.1.2/

编译安装

[root]# ./configure --enable-cgroups --with-hwloc-path=/usr/local # add any other
specified options
[root]# make
[root]# make install

设置路径

[root]# . /etc/profile.d/torque.sh

初始化serverdb

[root]# ./torque.setup root

在Torque Server Host上,创建packages

[root]# make packages
Building ./torque-package-clients-linux-x86_64.sh ...
Building ./torque-package-mom-linux-x86_64.sh ...
Building ./torque-package-server-linux-x86_64.sh ...
Building ./torque-package-gui-linux-x86_64.sh ...
Building ./torque-package-devel-linux-x86_64.sh ...
Done.
The package files are self-extracting packages that can be copied and executed
on your production machines. Use --help for options.

把MOM package和client package 拷贝到计算节点。建议拷贝到共享区。

[root]# scp torque-package-mom-linux-x86_64.sh <mom-node>:
[root]# scp torque-package-clients-linux-x86_64.sh <torque-client-host>:

把pbs_server,pbs_mom和trqauthd启动脚本拷贝到管理节点和计算节点对应位置。建议拷贝到共享区。

cp contrib/systemd/pbs_mom.service /usr/lib/systemd/system/pbs_mom.service
cp contrib/systemd/pbs_server.service /usr/lib/systemd/system/pbs_server.service
cp contrib/systemd/trqauthd.service /usr/lib/systemd/system/trqauthd.service
scp contrib/systemd/pbs_mom.service <mom-node>:/usr/lib/systemd/system/
scp contrib/systemd/trqauthd.service <torque-clienthost>:/usr/lib/systemd/system/

开启pbs_server,pbs_mom,trqauthd服务

qterm
systemctl enable pbs_server.service
systemctl restart pbs_server.service
systemctl enable pbs_mom.service
systemctl restart pbs_mom.service
systemctl enable trqauthd.service
systemctl restart trqauthd.service

编辑/var/spool/torque/server_priv/nodes,加入计算节点。例如:

node006 np=2
node007 np=2
node008 np=4

systemctl restart pbs_server.service

3.2 配置计算节点

安装依赖包

[root]# yum install libcgroup-tools

  安装hwloc

When cgroups are enabled (recommended), hwloc version 1.9.1 or later is required.

Download hwloc-1.9.1.tar.gz from https://www.open-mpi.org/software/hwloc/v1.9.

yum install gcc make
tar -xzvf hwloc-1.9.1.tar.gz
 cd hwloc-1.9.1
 ./configure
 make
 make install
 echo /usr/local/lib >/etc/ld.so.conf.d/hwloc.conf
ldconfig

安装MOM package和client package

./torque-package-mom-linux-x86_64.sh --install
./torque-package-clients-linux-x86_64.sh --install

开启pbs_mom,trqauthd服务

systemctl enable pbs_mom.service
systemctl start pbs_mom.service
 systemctl enable trqauthd.service
systemctl start trqauthd.service
vi /var/spool/torque/mom_priv/config
$pbsserver headnode # hostname running pbs server
$logevent 225 # bitmap of which events to log
 service pbs_mom restart

3.3 测试服务配置

举例:

# verify all queues are properly configured
> qstat -q
server:kmn
Queue Memory CPU Time Walltime Node Run Que Lm State
----- ------ -------- -------- ---- --- --- -- -----
batch -- -- -- -- 0 0 -- ER
--- ---
0 0
# view additional server configuration
> qmgr -c 'p s'
##
Create queues and set their attributes
###
Create and define queue batch
# create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
##
Set server attributes.
# set server scheduling =
True
set server acl_hosts = kmn
set server managers = user1@kmn
set server operators = user1@kmn
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 45
set server poll_jobs = True
set server mom_job_sync = True
set server keep_completed = 300
set server next_job_number = 0
# verify all nodes are correctly reporting
> pbsnodes -a
node001
state=free
np=2
properties=bigmem,fast,ia64,smp
ntype=cluster
status=rectime=1328810402,varattr=,jobs=,state=free,netload=6814326158,gres=,loadave=0
.21,ncpus=6,physmem=8193724kb,
availmem=13922548kb,totmem=16581304kb,idletime=3,nusers=3,nsessions=18,sessions=1876
1120 1912 1926 1937 1951 2019 2057 28399 2126 2140 2323 5419 17948 19356 27726 22254
29569,uname=Linux kmn 2.6.38-11-generic #48-Ubuntu SMP Fri Jul 29 19:02:55 UTC 2011
x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
# submit a basic job - DO NOT RUN AS ROOT
> su - testuser
> echo "sleep 30" | qsub
# verify jobs display
> qstat
Job id Name User Time Use S Queue
------ ----- ---- -------- -- -----
0.kmn STDIN knielson 0 Q batch

此时,因为scheduler还没有运行,作业不会run,接下来安装scheduler Maui.

4、安装MAUI作业调度软件

由于详细安装步骤繁杂,这里只做简要说明,详细步骤请参照官方文档:

MAUI:http://docs.adaptivecomputing.com/maui/index.php

先去官网下载软件包:

maui:http://www.adaptivecomputing.com/support/download-center/maui-cluster-scheduler/

这里下载的版本是: maui-3.3.1

编译安装

> gtar -xzvf maui-3.2.6.tar.gz 
> cd maui-3.2.6 
> ./configure 
> make 
make install 

加入路径

[root@node1 ~]# vi /etc/profile
添加:
export PATH=/usr/local/maui/bin/:/usr/local/maui/sbin/:$PATH

[root@node1 ~]# source /etc/profile

配置maui.cfg 先暂时使用默认配置即可,有更多需求,可以修改此文件。

启动 maui

写入/etc/rc.local: /usr/local/maui/sbin/maui

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment