Notes on Torque: Difference between revisions
mNo edit summary |
|||
(13 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
=== Overview === |
=== Overview === |
||
The CSML uses Torque with Maui for most of its job scheduling. Please see the Cambridge Dept. of Chemistry's [http://www.ch.cam.ac.uk/computing/maui-and-torque-introduction user guide] for information on submitting and checking jobs, and the associated [http://www.ch.cam.ac.uk/computing/maui-administration admin guide] for more job scheduling tools. |
|||
=== General usage === |
=== General usage === |
||
Line 9: | Line 9: | ||
=== Special usage notes === |
=== Special usage notes === |
||
<ul> |
|||
⚫ | |||
<li>qb is a succinct queue monitoring script that tallies each user's running and queued jobs, total cores used and free, and down nodes (if any). The script was originally written by Daniel Sinkovits and should be available to copy from another user's $HOME/bin folder to your own. Maintenance includes keeping the username list up-to-date and adjusting the variable <tt>node_total</tt> (currently 37) if a node goes down permanently. |
|||
⚫ | |||
<li>If qb indicates a node is down, check Ganglia on Minotaur or Hydra to find which node it is (down nodes are in red at the top of the list). Grep the output of qstat to find the jobs running on any down nodes with |
|||
⚫ | where <tt>[new_cput]</tt> is the desired amount of cpu time and <tt>[job_id]</tt> is the job's ID number. Multiple jobs can be changed simultaneously by using <tt>`seq [min_job_id] [max_job_id]`</tt> in the place of <tt>job_id</tt>. This will change all jobs <tt>[min_job_id]</tt> through <tt>[max_job_id]</tt>. |
||
<pre> |
|||
qstat -f | grep -B 3 [node_ID] |
|||
</pre> |
|||
where <tt>[node_ID]</tt> is the name of the down node, e.g., <tt>h036</tt> on Hydra. Make sure the owners of all jobs on the node have a chance to take note of which of their jobs went down and then restart the down nodes using Microway control. |
|||
⚫ | |||
<pre> |
|||
⚫ | |||
</pre> |
|||
⚫ | where <tt>[new_cput]</tt> is the desired amount of cpu time and <tt>[job_id]</tt> is the job's ID number. Multiple jobs can be changed simultaneously by using <tt>`seq [min_job_id] [max_job_id]`</tt> in the place of <tt>job_id</tt>. This will change all jobs <tt>[min_job_id]</tt> through <tt>[max_job_id]</tt>. Lowering the cpu time requirement of a job can decrease its wait time in the queue, as the scheduler is more likely to be able to use it for backfilling. On the other hand, increasing the cpu time requirement can be used to ensure that a job is able to finish properly (and can be done even while the job is running), but requires root permission. |
||
</li> |
|||
<li>Move a queued (i.e., waiting) job to a different queue via |
|||
<pre> |
|||
qmove [destination] [job_id] |
|||
</pre> |
|||
where <tt>[destination]</tt> is the new queue (either 'fast' or 'default' for our system) and <tt>[job_id]</tt> is the job ID. |
|||
<li>Delete a sequence of jobs via |
|||
<pre> |
|||
qdel `seq [min_job_id] [max_job_id]` |
|||
</pre> |
|||
where <tt>[min_job_id]</tt> is the first job ID of the sequence of jobs you want to delete and <tt>[max_job_id]</tt> is the last one. |
|||
</ul> |
Latest revision as of 12:08, 2 November 2015
Overview
The CSML uses Torque with Maui for most of its job scheduling. Please see the Cambridge Dept. of Chemistry's user guide for information on submitting and checking jobs, and the associated admin guide for more job scheduling tools.
General usage
Special usage notes
- qb is a succinct queue monitoring script that tallies each user's running and queued jobs, total cores used and free, and down nodes (if any). The script was originally written by Daniel Sinkovits and should be available to copy from another user's $HOME/bin folder to your own. Maintenance includes keeping the username list up-to-date and adjusting the variable node_total (currently 37) if a node goes down permanently.
- If qb indicates a node is down, check Ganglia on Minotaur or Hydra to find which node it is (down nodes are in red at the top of the list). Grep the output of qstat to find the jobs running on any down nodes with
qstat -f | grep -B 3 [node_ID]
where [node_ID] is the name of the down node, e.g., h036 on Hydra. Make sure the owners of all jobs on the node have a chance to take note of which of their jobs went down and then restart the down nodes using Microway control.
- Change the total cpu time allotted to a job via
qalter -l cput=[new_cput] [job_id]
where [new_cput] is the desired amount of cpu time and [job_id] is the job's ID number. Multiple jobs can be changed simultaneously by using `seq [min_job_id] [max_job_id]` in the place of job_id. This will change all jobs [min_job_id] through [max_job_id]. Lowering the cpu time requirement of a job can decrease its wait time in the queue, as the scheduler is more likely to be able to use it for backfilling. On the other hand, increasing the cpu time requirement can be used to ensure that a job is able to finish properly (and can be done even while the job is running), but requires root permission.
- Move a queued (i.e., waiting) job to a different queue via
qmove [destination] [job_id]
where [destination] is the new queue (either 'fast' or 'default' for our system) and [job_id] is the job ID.
- Delete a sequence of jobs via
qdel `seq [min_job_id] [max_job_id]`
where [min_job_id] is the first job ID of the sequence of jobs you want to delete and [max_job_id] is the last one.