slurmのインストールとslurmの再起動

5879 ワード

1.Opensslとmungeをインストールする
2. install
Install(caoj7)
./configure --prefix=/
usr
/local --
sysconfdir
=/
usr
/local/
etc
--enable-debug
make
sudo
make install
2. Slurm.conf (If revised, slurmctld andslurmd need toreboot)

Use doc/html/
configurator.html
to create
slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=vm1
#ControlAddr=
# 
#MailProg=/bin/mail 
MpiDefault=none
#MpiParams=ports=#-# 
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817 
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818 
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=caoj7
SlurmdUser=caoj7 
StateSaveLocation=/var/spool
SwitchType=switch/none
TaskPlugin=task/none
# 
# 
# TIMERS 
#KillWait=30 
#MinJobAge=300 
#SlurmctldTimeout=120 
#SlurmdTimeout=300 
# 
# 
# SCHEDULING 
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321 
SelectType=select/linear
# 
# 
# LOGGING AND ACCOUNTING 
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30 
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3 
#SlurmctldLogFile=
#SlurmdDebug=3 
#SlurmdLogFile=
# 
# 
# COMPUTE NODES 
NodeName=vm[2-5] CPUs=4 State=UNKNOWN 
PartitionName=compute Nodes=vm[2-5] Default=YES MaxTime=INFINITE State=UP


/
usr
/local/
etc
/
slurm.conf
 (revised SlurmUser=caoj7 SlurmdUser=caoj7)

sudo
scp
/
usr
/local/
etc
/
slurm.conf
     vm2
:/
usr
/local/
etc
/   (etc.)

sudo
chown
caoj7:caoj7/
usr
/local/
etc
/
slurm.conf
 (etc.)
3. 
Createfile and
dir

sudo 
touch/
var
/run/
slurmctld.pid

sudo 
chown
caoj7:caoj7/
var
/run/
slurmctld.pid

sudo 
touch/
var
/run/
slurmd.pid

sudo 
chown
caoj7:caoj7/
var
/run/
slurmd.pid

touch/
var
/run/
slurmd.pid
–sudo mkdir/var/spool/slurmd
•sudo chown -R caoj7:caoj7/var/spool/slurmd

sudo 
touch/
var
/spool/
job_state

sudo 
chown
caoj7:caoj7/
var
/spool/
job_state

sudo 
touch/
var
/spool/
resv_state

sudo 
chown
caoj7:caoj7/
var
/spool/
resv_state

sudo 
touch/
var
/spool/
node_state

sudo 
chown
caoj7:caoj7/
var
/spool/
node_state

sudo 
touch/
var
/spool/
trigger_state

sudo 
chown
caoj7:caoj7/
var
/spool/
trigger_state
4. 
Startup

Master

slurmctld
-D
vvvvvv

If/
var
/run/
slurmctld.pid
is removed, use vi to re-createit

Slave

slurmd
-D
vvvvvv

If/
var
/run/
slurmd.pid
is removed, use vi to re-createit
5. Error
Slurmctld
error: authentication: expired
credential

Timer
isnot sync.

Date –s “2012-9-3 14:27:00”

Reboot
munge
and
slurm
Ifnode002 can’t register to master

Might because
ssh

Try 
ssh
masternode
(e.g., node001) from
node002
sallocエラー
[caoj7@vm2
mpi
]$
salloc
-N2

-bash:./
salloc
:/lib/ld-linux.so.2: bad ELFinterpreter: No such file or
directory

[caoj7@vm1
mpi
]$
ldd
/
usr
/local/bin/
salloc

  linux-vdso.so.1 =>  (0x00007fff0ebff000)

  libdl.so.2 =>/lib64/libdl.so.2 (0x0000003d3f000000)

  libpthread.so.0 =>/lib64/libpthread.so.0 (0x0000003d6e000000)

  libc.so.6 =>/lib64/libc.so.6(0x0000003d6dc00000)

 /lib64/ld-linux-x86-64.so.2(0x0000003d6d400000
)

[caoj7@vm1
mpi
]$ cd/lib

[caoj7@vm1lib]$
ln
-s/lib64/ld-linux-x86-64.so.2 ld-linux.so.2
しかし、その後またエラーが発生しました.unlink後は正しいです.
------------------------------------------------------------------
再起動
1.mungeを起動する
[caoj7@vm5 ~]$ sudo/etc/init.d/munge start
2.slurmctldまたはslurmdの起動
[caoj7@vm5 ~]$ slurmd -D vvvvvv
slurmd: slurmd version 2.4.4 started
slurmd: error: Unable to open pidfile `/var/run/slurmd.pid':
Permission denied
slurmd: slurmd started on Fri 30 Nov 2012 09:57:55 +0000
slurmd: CPUs=4 Sockets=4 Cores=1 Threads=1 Memory=15949 TmpDisk=21851 Uptime=846
^Cslurmd: error: Unable to remove pidfile `/var/run/slurmd.pid': No such file or directory
slurmd: Slurmd shutdown completing
[caoj7@vm5 ~]$ sudo touch/var/run/slurmd.pid [caoj7@vm5 ~]$ sudo chown caoj7:caoj7/var/run/slurmd.pid
[caoj7@vm5 ~]$ slurmd -D vvvvvv
slurmd: slurmd version 2.4.4 started
slurmd: error:
Possible corrupt pidfile
`/var/run/slurmd.pid'
slurmd: slurmd started on Fri 30 Nov 2012 09:58:48 +0000
slurmd: CPUs=4 Sockets=4 Cores=1 Threads=1 Memory=15949 TmpDisk=21851 Uptime=899
^Cslurmd: error: Unable to remove pidfile `/var/run/slurmd.pid': Permission denied
slurmd: Slurmd shutdown completing
[caoj7@vm5 ~]$ touch/var/run/slurmd.pid 
[caoj7@vm5 ~]$ slurmd -D vvvvvv
slurmd: slurmd version 2.4.4 started
slurmd: slurmd started on Fri 30 Nov 2012 09:59:14 +0000
slurmd: CPUs=4 Sockets=4 Cores=1 Threads=1 Memory=15949 TmpDisk=21851 Uptime=925