[Rocks-Discuss] All nodes down

Juan J. Meléndez Martí­nez melendez at unex.es
Mon Jul 19 10:05:07 PDT 2010


Hi all

I have a Rocks 5.3 cluster with 9 compute nodes. All of them were working
fine until a power cut last Friday. After that, all the nodes appear as
down despite they are switched on:

[root at wotan gsm]# rocks run host 'w'
compute-0-6: down
compute-0-8: down
compute-0-3: down
compute-0-7: down
compute-0-4: down
compute-0-5: down
compute-0-0: down
compute-0-1: down
compute-0-2: down

Ping command gives:

[root at wotan gsm]# ping compute-0-0
PING compute-0-0.local (10.1.255.254) 56(84) bytes of data.
>From wotan.local (10.1.1.1) icmp_seq=17 Destination Host Unreachable
>From wotan.local (10.1.1.1) icmp_seq=18 Destination Host Unreachable
>From wotan.local (10.1.1.1) icmp_seq=19 Destination Host Unreachable
>From wotan.local (10.1.1.1) icmp_seq=21 Destination Host Unreachable
>From wotan.local (10.1.1.1) icmp_seq=22 Destination Host Unreachable
>From wotan.local (10.1.1.1) icmp_seq=23 Destination Host Unreachable

--- compute-0-0.local ping statistics ---
23 packets transmitted, 0 received, +6 errors, 100% packet loss, time 22000ms
, pipe 3

However, "qstat" SGE command gives some output:

[root at wotan gsm]# qstat -f
queuename                      qtype resv/used/tot. load_avg arch         
state                                s
--------------------------------------------------------------------------------
                               -
all.q at compute-0-0.local        BIP   0/8/8          -NA-     lx26-amd64    au
--------------------------------------------------------------------------------
                               -
all.q at compute-0-1.local        BIP   0/8/8          -NA-     lx26-amd64    au
--------------------------------------------------------------------------------
                               -
all.q at compute-0-2.local        BIP   0/8/8          -NA-     lx26-amd64    au
--------------------------------------------------------------------------------
                               -
all.q at compute-0-3.local        BIP   0/8/8          -NA-     lx26-amd64    au
--------------------------------------------------------------------------------
                               -
all.q at compute-0-4.local        BIP   0/8/8          -NA-     lx26-amd64    au
--------------------------------------------------------------------------------
                               -
all.q at compute-0-5.local        BIP   0/0/8          -NA-     lx26-amd64    au
--------------------------------------------------------------------------------
                               -
all.q at compute-0-6.local        BIP   0/8/8          -NA-     lx26-amd64    au
--------------------------------------------------------------------------------
                               -
all.q at compute-0-7.local        BIP   0/0/8          -NA-     lx26-amd64    au
--------------------------------------------------------------------------------
                               -
all.q at compute-0-8.local        BIP   0/0/8          -NA-     lx26-amd64    au


Any ideas will be welcome!

Cheers

Juanjo


Juan J. Melendez
Associate Professor
Departament of Physics
University of Extremadura
Avda. de Elvas, s/n  06006  Badajoz (Spain)
Phone: +34 924 289 655
Fax: 924 289 651
e-mail: melendez at unex.es



More information about the npaci-rocks-discussion mailing list