slurmのインストールとslurmの再起動
5879 ワード
1.Opensslとmungeをインストールする
2. install
Install(caoj7)
./configure --prefix=/
usr
/local --
sysconfdir
=/
usr
/local/
etc
--enable-debug
make
sudo
make install
2. Slurm.conf (If revised, slurmctld andslurmd need toreboot)
–
Use doc/html/
configurator.html
to create
slurm.conf
–
/
usr
/local/
etc
/
slurm.conf
(revised SlurmUser=caoj7 SlurmdUser=caoj7)
–
sudo
scp
/
usr
/local/
etc
/
slurm.conf
vm2
:/
usr
/local/
etc
/ (etc.)
–
sudo
chown
caoj7:caoj7/
usr
/local/
etc
/
slurm.conf
(etc.)
3.
Createfile and
dir
–
sudo
touch/
var
/run/
slurmctld.pid
•
sudo
chown
caoj7:caoj7/
var
/run/
slurmctld.pid
–
sudo
touch/
var
/run/
slurmd.pid
•
sudo
chown
caoj7:caoj7/
var
/run/
slurmd.pid
•
touch/
var
/run/
slurmd.pid
–sudo mkdir/var/spool/slurmd
•sudo chown -R caoj7:caoj7/var/spool/slurmd
–
sudo
touch/
var
/spool/
job_state
•
sudo
chown
caoj7:caoj7/
var
/spool/
job_state
–
sudo
touch/
var
/spool/
resv_state
•
sudo
chown
caoj7:caoj7/
var
/spool/
resv_state
–
sudo
touch/
var
/spool/
node_state
•
sudo
chown
caoj7:caoj7/
var
/spool/
node_state
–
sudo
touch/
var
/spool/
trigger_state
•
sudo
chown
caoj7:caoj7/
var
/spool/
trigger_state
4.
Startup
–
Master
•
slurmctld
-D
vvvvvv
•
If/
var
/run/
slurmctld.pid
is removed, use vi to re-createit
–
Slave
•
slurmd
-D
vvvvvv
•
If/
var
/run/
slurmd.pid
is removed, use vi to re-createit
5. Error
Slurmctld
error: authentication: expired
credential
–
Timer
isnot sync.
–
Date –s “2012-9-3 14:27:00”
–
Reboot
munge
and
slurm
Ifnode002 can’t register to master
–
Might because
ssh
–
Try
ssh
masternode
(e.g., node001) from
node002
sallocエラー
[caoj7@vm2
mpi
]$
salloc
-N2
•
-bash:./
salloc
:/lib/ld-linux.so.2: bad ELFinterpreter: No such file or
directory
–
[caoj7@vm1
mpi
]$
ldd
/
usr
/local/bin/
salloc
–
linux-vdso.so.1 => (0x00007fff0ebff000)
–
libdl.so.2 =>/lib64/libdl.so.2 (0x0000003d3f000000)
–
libpthread.so.0 =>/lib64/libpthread.so.0 (0x0000003d6e000000)
–
libc.so.6 =>/lib64/libc.so.6(0x0000003d6dc00000)
–
/lib64/ld-linux-x86-64.so.2(0x0000003d6d400000
)
•
[caoj7@vm1
mpi
]$ cd/lib
•
[caoj7@vm1lib]$
ln
-s/lib64/ld-linux-x86-64.so.2 ld-linux.so.2
しかし、その後またエラーが発生しました.unlink後は正しいです.
------------------------------------------------------------------
再起動
1.mungeを起動する
[caoj7@vm5 ~]$ sudo/etc/init.d/munge start
2.slurmctldまたはslurmdの起動
[caoj7@vm5 ~]$ slurmd -D vvvvvv
slurmd: slurmd version 2.4.4 started
slurmd: error: Unable to open pidfile `/var/run/slurmd.pid':
Permission denied
slurmd: slurmd started on Fri 30 Nov 2012 09:57:55 +0000
slurmd: CPUs=4 Sockets=4 Cores=1 Threads=1 Memory=15949 TmpDisk=21851 Uptime=846
^Cslurmd: error: Unable to remove pidfile `/var/run/slurmd.pid': No such file or directory
slurmd: Slurmd shutdown completing
[caoj7@vm5 ~]$ sudo touch/var/run/slurmd.pid [caoj7@vm5 ~]$ sudo chown caoj7:caoj7/var/run/slurmd.pid
[caoj7@vm5 ~]$ slurmd -D vvvvvv
slurmd: slurmd version 2.4.4 started
slurmd: error:
Possible corrupt pidfile
`/var/run/slurmd.pid'
slurmd: slurmd started on Fri 30 Nov 2012 09:58:48 +0000
slurmd: CPUs=4 Sockets=4 Cores=1 Threads=1 Memory=15949 TmpDisk=21851 Uptime=899
^Cslurmd: error: Unable to remove pidfile `/var/run/slurmd.pid': Permission denied
slurmd: Slurmd shutdown completing
[caoj7@vm5 ~]$ touch/var/run/slurmd.pid
[caoj7@vm5 ~]$ slurmd -D vvvvvv
slurmd: slurmd version 2.4.4 started
slurmd: slurmd started on Fri 30 Nov 2012 09:59:14 +0000
slurmd: CPUs=4 Sockets=4 Cores=1 Threads=1 Memory=15949 TmpDisk=21851 Uptime=925
2. install
Install(caoj7)
./configure --prefix=/
usr
/local --
sysconfdir
=/
usr
/local/
etc
--enable-debug
make
sudo
make install
2. Slurm.conf (If revised, slurmctld andslurmd need toreboot)
–
Use doc/html/
configurator.html
to create
slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=vm1
#ControlAddr=
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=caoj7
SlurmdUser=caoj7
StateSaveLocation=/var/spool
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3
#SlurmctldLogFile=
#SlurmdDebug=3
#SlurmdLogFile=
#
#
# COMPUTE NODES
NodeName=vm[2-5] CPUs=4 State=UNKNOWN
PartitionName=compute Nodes=vm[2-5] Default=YES MaxTime=INFINITE State=UP
–
/
usr
/local/
etc
/
slurm.conf
(revised SlurmUser=caoj7 SlurmdUser=caoj7)
–
sudo
scp
/
usr
/local/
etc
/
slurm.conf
vm2
:/
usr
/local/
etc
/ (etc.)
–
sudo
chown
caoj7:caoj7/
usr
/local/
etc
/
slurm.conf
(etc.)
3.
Createfile and
dir
–
sudo
touch/
var
/run/
slurmctld.pid
•
sudo
chown
caoj7:caoj7/
var
/run/
slurmctld.pid
–
sudo
touch/
var
/run/
slurmd.pid
•
sudo
chown
caoj7:caoj7/
var
/run/
slurmd.pid
•
touch/
var
/run/
slurmd.pid
–sudo mkdir/var/spool/slurmd
•sudo chown -R caoj7:caoj7/var/spool/slurmd
–
sudo
touch/
var
/spool/
job_state
•
sudo
chown
caoj7:caoj7/
var
/spool/
job_state
–
sudo
touch/
var
/spool/
resv_state
•
sudo
chown
caoj7:caoj7/
var
/spool/
resv_state
–
sudo
touch/
var
/spool/
node_state
•
sudo
chown
caoj7:caoj7/
var
/spool/
node_state
–
sudo
touch/
var
/spool/
trigger_state
•
sudo
chown
caoj7:caoj7/
var
/spool/
trigger_state
4.
Startup
–
Master
•
slurmctld
-D
vvvvvv
•
If/
var
/run/
slurmctld.pid
is removed, use vi to re-createit
–
Slave
•
slurmd
-D
vvvvvv
•
If/
var
/run/
slurmd.pid
is removed, use vi to re-createit
5. Error
Slurmctld
error: authentication: expired
credential
–
Timer
isnot sync.
–
Date –s “2012-9-3 14:27:00”
–
Reboot
munge
and
slurm
Ifnode002 can’t register to master
–
Might because
ssh
–
Try
ssh
masternode
(e.g., node001) from
node002
sallocエラー
[caoj7@vm2
mpi
]$
salloc
-N2
•
-bash:./
salloc
:/lib/ld-linux.so.2: bad ELFinterpreter: No such file or
directory
–
[caoj7@vm1
mpi
]$
ldd
/
usr
/local/bin/
salloc
–
linux-vdso.so.1 => (0x00007fff0ebff000)
–
libdl.so.2 =>/lib64/libdl.so.2 (0x0000003d3f000000)
–
libpthread.so.0 =>/lib64/libpthread.so.0 (0x0000003d6e000000)
–
libc.so.6 =>/lib64/libc.so.6(0x0000003d6dc00000)
–
/lib64/ld-linux-x86-64.so.2(0x0000003d6d400000
)
•
[caoj7@vm1
mpi
]$ cd/lib
•
[caoj7@vm1lib]$
ln
-s/lib64/ld-linux-x86-64.so.2 ld-linux.so.2
しかし、その後またエラーが発生しました.unlink後は正しいです.
------------------------------------------------------------------
再起動
1.mungeを起動する
[caoj7@vm5 ~]$ sudo/etc/init.d/munge start
2.slurmctldまたはslurmdの起動
[caoj7@vm5 ~]$ slurmd -D vvvvvv
slurmd: slurmd version 2.4.4 started
slurmd: error: Unable to open pidfile `/var/run/slurmd.pid':
Permission denied
slurmd: slurmd started on Fri 30 Nov 2012 09:57:55 +0000
slurmd: CPUs=4 Sockets=4 Cores=1 Threads=1 Memory=15949 TmpDisk=21851 Uptime=846
^Cslurmd: error: Unable to remove pidfile `/var/run/slurmd.pid': No such file or directory
slurmd: Slurmd shutdown completing
[caoj7@vm5 ~]$ sudo touch/var/run/slurmd.pid [caoj7@vm5 ~]$ sudo chown caoj7:caoj7/var/run/slurmd.pid
[caoj7@vm5 ~]$ slurmd -D vvvvvv
slurmd: slurmd version 2.4.4 started
slurmd: error:
Possible corrupt pidfile
`/var/run/slurmd.pid'
slurmd: slurmd started on Fri 30 Nov 2012 09:58:48 +0000
slurmd: CPUs=4 Sockets=4 Cores=1 Threads=1 Memory=15949 TmpDisk=21851 Uptime=899
^Cslurmd: error: Unable to remove pidfile `/var/run/slurmd.pid': Permission denied
slurmd: Slurmd shutdown completing
[caoj7@vm5 ~]$ touch/var/run/slurmd.pid
[caoj7@vm5 ~]$ slurmd -D vvvvvv
slurmd: slurmd version 2.4.4 started
slurmd: slurmd started on Fri 30 Nov 2012 09:59:14 +0000
slurmd: CPUs=4 Sockets=4 Cores=1 Threads=1 Memory=15949 TmpDisk=21851 Uptime=925