[Rocks-Discuss] "Some compute nodes not accepting jobs"

Mike Hanby mhanby at uab.edu
Mon Jul 19 12:27:15 PDT 2010


Some times if a job fails in a way that makes SGE think the node might be at fault, SGE will mark the node in error.

'qstat -f' will show any nodes marked as errored
'qstat -g c' will give you a quick summary of total slots, slots in use, and slots offline/errored, etc...

You may be able to find some information in /opt/gridengine/default/spool/qmaster/messages

It's a good idea to run the 'qstat -g c' command periodically. It can help head off some support calls, especially if you have some users who are node watchers ;-)

Also, check 'qhost' from time to time to make sure none of your nodes are overloaded (i.e. jobs behaving badly) as it will display load, memory used and swap used.

Mike
-----Original Message-----
From: npaci-rocks-discussion-bounces at sdsc.edu [mailto:npaci-rocks-discussion-bounces at sdsc.edu] On Behalf Of Hobbick, Christopher Charles
Sent: Monday, July 19, 2010 12:13 PM
To: Discussion of Rocks Clusters
Subject: Re: [Rocks-Discuss] "Some compute nodes not accepting jobs"

Thanks. That worked.  Any idea what would cause that to happen?

-----Original Message-----
From: npaci-rocks-discussion-bounces at sdsc.edu [mailto:npaci-rocks-discussion-bounces at sdsc.edu] On Behalf Of Nick Holway
Sent: Monday, July 19, 2010 12:58 PM
To: Discussion of Rocks Clusters
Subject: Re: [Rocks-Discuss] "Some compute nodes not accepting jobs"

Can you do a quick "qstat -f" and see if there are any queues flagged with errors (ie have an E on the right). You can clear any queues with errors with "qmod -c all at compute-x-x" or "qmod -c \*" will clear the errors from all queues.

Nick

On 19 July 2010 17:34, Hobbick, Christopher Charles <chobbick at iupui.edu> wrote:
> SGE
>
> -----Original Message-----
> From: npaci-rocks-discussion-bounces at sdsc.edu 
> [mailto:npaci-rocks-discussion-bounces at sdsc.edu] On Behalf Of Bart 
> Brashers
> Sent: Monday, July 19, 2010 12:12 PM
> To: Discussion of Rocks Clusters
> Subject: Re: [Rocks-Discuss] "Some compute nodes not accepting jobs"
>
>
> SGE or Torque/Maui?
>
> B
>
>> I have a 5.3 cluster with 6 nodes.  I've had no problems with 
>> anything
> until
>> last week when a couple jobs got stuck on 2 of the nodes.  A user had
> specified
>> in his job to run one job on compute-0-0 and the other on compute-0-1
> (he's done
>> this before with no problems.)  After the jobs never got submitted,
> the nodes
>> would no longer accept any jobs.  I've tried restarted the nodes,
> restarted the
>> sge service on the head node, and rebooting the whole cluster, but
> nothing seems
>> to work.  The other nodes accept jobs just fine, just not compute-0-0
> and
>> compute-0-1.
>>
>> Thanks
>> Chris
>
>
> This message contains information that may be confidential, privileged or otherwise protected by law from disclosure. It is intended for the exclusive use of the Addressee(s). Unless you are the addressee or authorized agent of the addressee, you may not review, copy, distribute or disclose to anyone the message or any information contained within. If you have received this message in error, please contact the sender by electronic reply to email at environcorp.com and immediately delete all copies of the message.
>


More information about the npaci-rocks-discussion mailing list