[Rocks-Discuss] Nodes downed by state 'au'

Chris Dagdigian dag at sonsorol.org
Thu Mar 3 04:20:15 PST 2011


State "au" is a combination of two errors in Grid Engine, both of them 
transient in that they will clear themselves automatically when the 
underlying conditions are resolved.

The errors are short codes for two states:

   (a)larm, (u)nreachable

The (u) comes from the node being unreachable via the network, when the 
sge_execd is not heard from by the sge_qmaster within the configured 
timeout window then the node is placed unit (u)nreachable state

The (a) comes along for the ride as in the default install of SGE, the 
scheduler is trained to assume a load average of 99.99 for nodes that do 
not report in actual load values. The assumed load average is a safety 
valve that prevents the scheduler from considering the node for more work.

It is very common to see state (au) with grid engine whenever:

  - A node is down
  - A node is hung/frozen
  - Network problems

So the normal procedure in this case is a hard reboot or even a ROCKS 
reinstall of the node.

What you are seeing is less common - an application that puts nodes into 
'au' state. State 'au' is far more common when there is a hardware, OS 
or network issue ...

I see application 'au' states very rarely and usually only when a job is 
hammering an NFS server really hard or perhaps doing a final set of data 
collection or analysis. Sometimes the job takes so much resources up on 
the node that it goes 'au' for a short period of time. When we know it's 
just the job we let it do it's thing and wait for the errors to clear 
themselves.

Not sure what is up with your c++ code, have you ever let it run to 
completion to see if the 'au' state goes away?

You really need to figure out what the c++ code is doing, is it actually 
crashing or hanging a node or is it just temporarily tying up enough 
resources that the node is hard to reach via the network? Do the 'au' 
states only happen on certain machines over and over again or do they 
show up randomly across the cluster?

Best debug tips:

  - Let the 'au' nodes stay to determine if the issue is transient or 
not. Maybe your app is still running but just maxing out resources on 
the compute nodes, are it's data/output files still growing? Can you 
find any proof that it actually crashed or stopped?

  - Check the STDOUT and STDERR files for the app on the nodes that went 
'au' to see if there are any useful messages

  - Check the SGE qmaster and scheduler logs for messages

  - Check the SGE spool messages file for the nodes that went 'au'

  - Check the system /var/log/ logs on the nodes that went 'au'

  - Attach a console to a node to see if you can see the issue 'live'

  - Determine if the 'au' happens reliably on certain nodes or randomly 
throughput.





Mahmood, Nasir wrote:
> Hi All,
>
> We are having some state 'au' issue on our cluster.
>
> In fact, the problem is being caused by a nasty piece of C++ code. Whenever we run that code it brings some nodes in 'au' state. The nodes with 'au' state become inaccessible through ssh giving  following message:
>
> ssh: connect to host compute-0-11 port 22: Connection refused
>
> And the commands like
>
> rocks run host ''/opt/gridengine/bin/lx26-amd64/sge_execd"
>
> complain that the compute-0-11 is down.
>
> Being newbies, the only solution we know so far is manual/hard start of 'au' nodes.
>
> Can someone suggest elegant way to bring back the 'au' nodes while sitting at head node.
>
> Thanks a lot in advance,
>
> Nasir


More information about the npaci-rocks-discussion mailing list