[Rocks-Discuss] Nodes downed by state 'au'
dag at sonsorol.org
Thu Mar 3 04:20:15 PST 2011
State "au" is a combination of two errors in Grid Engine, both of them
transient in that they will clear themselves automatically when the
underlying conditions are resolved.
The errors are short codes for two states:
The (u) comes from the node being unreachable via the network, when the
sge_execd is not heard from by the sge_qmaster within the configured
timeout window then the node is placed unit (u)nreachable state
The (a) comes along for the ride as in the default install of SGE, the
scheduler is trained to assume a load average of 99.99 for nodes that do
not report in actual load values. The assumed load average is a safety
valve that prevents the scheduler from considering the node for more work.
It is very common to see state (au) with grid engine whenever:
- A node is down
- A node is hung/frozen
- Network problems
So the normal procedure in this case is a hard reboot or even a ROCKS
reinstall of the node.
What you are seeing is less common - an application that puts nodes into
'au' state. State 'au' is far more common when there is a hardware, OS
or network issue ...
I see application 'au' states very rarely and usually only when a job is
hammering an NFS server really hard or perhaps doing a final set of data
collection or analysis. Sometimes the job takes so much resources up on
the node that it goes 'au' for a short period of time. When we know it's
just the job we let it do it's thing and wait for the errors to clear
Not sure what is up with your c++ code, have you ever let it run to
completion to see if the 'au' state goes away?
You really need to figure out what the c++ code is doing, is it actually
crashing or hanging a node or is it just temporarily tying up enough
resources that the node is hard to reach via the network? Do the 'au'
states only happen on certain machines over and over again or do they
show up randomly across the cluster?
Best debug tips:
- Let the 'au' nodes stay to determine if the issue is transient or
not. Maybe your app is still running but just maxing out resources on
the compute nodes, are it's data/output files still growing? Can you
find any proof that it actually crashed or stopped?
- Check the STDOUT and STDERR files for the app on the nodes that went
'au' to see if there are any useful messages
- Check the SGE qmaster and scheduler logs for messages
- Check the SGE spool messages file for the nodes that went 'au'
- Check the system /var/log/ logs on the nodes that went 'au'
- Attach a console to a node to see if you can see the issue 'live'
- Determine if the 'au' happens reliably on certain nodes or randomly
Mahmood, Nasir wrote:
> Hi All,
> We are having some state 'au' issue on our cluster.
> In fact, the problem is being caused by a nasty piece of C++ code. Whenever we run that code it brings some nodes in 'au' state. The nodes with 'au' state become inaccessible through ssh giving following message:
> ssh: connect to host compute-0-11 port 22: Connection refused
> And the commands like
> rocks run host ''/opt/gridengine/bin/lx26-amd64/sge_execd"
> complain that the compute-0-11 is down.
> Being newbies, the only solution we know so far is manual/hard start of 'au' nodes.
> Can someone suggest elegant way to bring back the 'au' nodes while sitting at head node.
> Thanks a lot in advance,
More information about the npaci-rocks-discussion