[Rocks-Discuss] GPUs with Torque

Jonathan Gough jonathan.gough at liu.edu
Tue Mar 19 06:10:18 PDT 2013


Roy,


At this point I am prioritizing ease of use over functionality.


**Simplest case scenario:

i would like my students to be able to submit a job specifying the resource as

-l nodes=1:gpus=1

and I would like that to work on a node that has either 1 or 2 gpus
(the cluster is not homogenous)


**Better case scenario:

I would like the que to be able to distinguish between the two gpus -
simply meaning the user doesn't need to worry about accidentally
submitting 2 jobs to the same gpu on the same node.


**Best case scenario (i know it's possible, but we don't really need this):

torque can give/report specific details on the use and performance of
each gpu. I believe the basic/simplest setup uses nvidia-smi to gather
the data. (gpu_status -
http://docs.adaptivecomputing.com/torque/3-0-5/3.7schedulinggpus.php)


Using the default setup in ROCKS 6.1 GPUS are listed (pbsnodes) but
are not automatically identified.  They can be manually set (editing
the nodes file) and are therefore reported (pbsnodes) but When
submitting a job, the job is held - NoResources available is the
error.


Even just getting the simplest case scenario to work would be great...
the other alternative is to switch to sge and follow what gowtham
described, but that entails reworking how all the other jobs are
run/scheduled and re-educating the students.


Admittedly I don't have any conception of how one goes about building
a roll, is it even possible to create it such that the
--enable-nvidia-gpus flag is used durring configure?


Again, thanks for the willingness to help out a newbie(chemist)!

JOnathan

No sweat.

I'm a bit curious on what you need GPU-support for.  How do you want to run
your jobs on the GPU-enabled nodes?  From a torque perspective you can do
whatever you like within the job so I must admit I have never really
understood what GPU-support within a batch system context really means.  We
have a few GPU-nodes in our cluster and all we have done is to create a
separate queue for the nodes i question.

r.




On Mon, Mar 18, 2013 at 5:44 PM, Alex Chekholko <chekh at stanford.edu> wrote:

> The easy way to check is to see if you have a process called 'maui'
> running or a process called 'pbs_sched' running.
>
> The latter is the regular FIFO TORQUE scheduler, IIRC.
>
> And for Maui, there's likely a maui.conf somewhere with the scheduler
> policy definitions.
>
> Looks like a Google search for "maui pbs_sched" produces lots of relevant
> links, e.g.
> https://lists.sdsc.edu/**pipermail/npaci-rocks-**
> discussion/2011-April/052300.**html<https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2011-April/052300.html>
>
>
> On 03/18/2013 09:22 AM, Jonathan Gough wrote:
>
>>   I believe that one major disconnect for me is
>> that I don't fully understand what maui is and if I am actually using
>> it or not.
>>
>
> --
> Alex Chekholko chekh at stanford.edu
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20130319/20f73d0d/attachment.html 


More information about the npaci-rocks-discussion mailing list