Hardware: Difference between revisions

From csml-wiki.northwestern.edu
Jump to navigation Jump to search
(Created page with "'''Desktop machines''' '''Clusters''' * Minotaur * Hydra * Quest")
 
 
(82 intermediate revisions by 6 users not shown)
Line 1: Line 1:
'''Desktop machines'''
=== Desktop machines ===


All desktop machines run [http://www.opensuse.org OpenSuSE].
'''Clusters'''
* [[Installation instructions for OpenSuSE 13.1]].
* [[Installation instructions for OpenSuSE Leap 42.1]].
* [[Installation instructions for OpenSuSE Leap 15.0]].
* [[Installation instructions for OpenSuSE Leap 15.1]].


[[Etiquette: Running Nice on other desktop machines]]
* Minotaur

* Hydra
=== Clusters ===
* Quest

==== Minotaur ====
* 38 nodes, each containing two 10-core processors (760 cores total). 16 GB memory per node.<br>Processor type: Intel Xeon E5-2630, 2.2 GHz.
* Jobs are scheduled via [http://www.adaptivecomputing.com/products/open-source/torque/ Torque]/Maui. [[Notes on Torque]].

==== Hydra ====
* 60 nodes, each containing two 6-core processors (720 cores total). 12 GB memory per node.<br>8 nodes (queue "fast", nodes h001-h008) have Intel Xeon X5690 3.47 GHz processors.<br>52 nodes (queue "default", nodes h009-h060) have Intel Xeon E5645 2.40 GHz processors.
* Jobs are scheduled via [http://www.adaptivecomputing.com/products/open-source/torque/ Torque]/Maui. [[Notes on Torque]].
* [[General Usage of Hydra]]

==== Quest ====
* Jobs are scheduled via [http://www.adaptivecomputing.com/products/open-source/torque/ Torque]/Moab. [[Notes on Torque]].
* [[General Usage of Quest]]

=== Disk space, backups, and RAID storage ===

==== Disk space allocations and nightly backups ====

Each user has a home directory located on ''ariadne''. This home directory is exported to all desktop machines, so that you see the same home filesystem on each machine. The drive is protected against hardware failure via a [[http://en.wikipedia.org/wiki/RAID_1#RAID_1 RAID-1]] setup. Furthermore, each night all new or modified files on /home are written to tape (located in ariadne). This makes it important not to store temporary data in your home folder, as it would quickly fill up the tape. Since users tend to forget this, a quota system has been enabled on ariadne, restricting each user to 15 GB. To check how much space you are using log on to ariadne and issue the command
<pre>
quota -s
</pre>

In addition, each user has significant additional storage on the scratch partitions. These drives are located in the different desktop machines and protected via RAID-1, but backups are your own responsibility. Note that these partitions are generally only mounted on the desktop machine that contains the corresponding drives. If you need a partition to be exported to a different machine, please ask.

==== Synology diskstation ====

Pergamon is a 18 TB RAID 6 diskstation. [[Synology diskstation|Detailed usage instructions.]]

==== Changing the nightly backup tape ====

# Press eject button on tape drive in ariadne.
# Take the tape cartridge out of the drive and put it in its box (should be on top of ariadne). Label the box. Give to Erik.
# Insert cleaning tape (on top of ariadne). It will work for less than a minute and then eject automatically.
# Put cleaning tape back in box on top of ariadne.
# Insert new DDS tape (find in cabinet). Leave empty box on top of ariadne.
# Erik: Update settings in /usr/local/lib/backup, namely ''position'' and ''tapenumber''; update logfile.

==== Recovering data from the nightly backup tape ====

Log files of all nightly backup tapes are located on ariadne, in /usr/local/lib/backup. For privacy reasons, these logfiles are only accessible to root. Once the proper file to be recovered has been identified, insert the corresponding tape into the drive on ariadne and follow these steps (all to be executed as root):

# <tt>cd /</tt><br>(if you change to a different directory, the recovered file will be placed relative to this directory)
# <tt>/usr/local/bin/tape-rewind</tt><br>(or <tt>mtst -f /dev/nst0 rewind</tt>)
# <tt>mtst -f /dev/nst0 fsf <position></tt><br>(see the contents file in /usr/local/lib/backup for the position number)
# <tt>tar xzvf /dev/nst0 <full_file_name_without_leading_slash></tt><br>This step won't work unless you omit the leading slash; also note that you can specify multiple files, separated by spaces. The 'z' option is necessary because all nightly backups are compressed. For wildcards, use --wildcards and escape '*' and '?'. For example: <tt>tar -x --wildcards -zvf /dev/nst0 \*datafiles\*</tt>
# <tt>/usr/local/bin/tape-rewoffl</tt><br>(or <tt>mtst -f /dev/nst0 rewoffl</tt>)

==== Archiving data using the LTO tape drive ====

==== Checking RAID status ====

<ul>

<li><span id="hydra">Hydra</span>
<ul>
<li>OS is on software RAID (which spans /dev/sda and /dev/sdb). An overview is obtained via
<pre>cat /proc/mdstat</pre>
Detailed information via
<pre>
mdadm --detail /dev/mdX
</pre>
where X = 1, 5, 6, 7. Also see [[Setting up e-mail notifications for Linux Software RAID]].</li>
<li>/home: 11 TB RAID 6. Check via
<pre>
opera http://127.0.0.1:81
</pre>
If this doesn't work, restart Areca HTTP server via /etc/rc.d/arecaweb script. It is also possible to interrogate the controller from the command line via 'cli64'. Useful commands within cli64 include 'vsf info' (volume set information) and 'disk info'.
</li>
<li>/archive: 80 TB RAID ([http://en.wikipedia.org/wiki/Nested_RAID_levels#RAID_60_.28RAID_6.2B0.29 RAID 60]). Check via
<pre>
opera http://127.0.0.1:82
</pre>
Just as for /home, it is possible to use the 'cli64' command-line interface. However, to switch this to the proper controller, first use 'set curctrl=2'.
</li>
</ul>
</li>

<li>Minotaur<br>RAID-6 controller with 12 drives (incl. one hot spare). Web interface. Log in to the head node and use
<pre>
opera http://172.16.0.101
</pre>
RAID status is visible on the top line of the Raid Set Hierarchy table, under Volume State. For drive stability click any channel and find the SMART Attributes at the bottom of the page. Each has two values, Attribute and Threshold (Threshold is in parentheses). An Attribute value lower than Threshold indicates an unstable drive.
</li>

<li>Ariadne<br>RAID-5 controller with 4 drives. Status can be checked by interrogating the controller:
<pre>
/opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aALL | less
</pre>
In the 'Device Present' section, it is reported if any drives are critical or have failed, and what the state of the RAID is. More detailed information can also be found via
<pre>
/opt/MegaRAID/MegaCli/MegaCli64 -LDPDInfo -aAll | less
</pre>
Directly at the beginning (under 'Adapter #0') it should report 'State: Optimal'</li>

To start a volume consistency check:
<pre>
/opt/MegaRAID/MegaCli/MegaCli64 -LDCC -Start -LALL -aAll
</pre>
and to monitor the progress of this check:
<pre>
/opt/MegaRAID/MegaCli/MegaCli64 -LDCC -ShowProg -LALL -aAll
</pre>
or, for continuous display:
<pre>
/opt/MegaRAID/MegaCli/MegaCli64 -LDCC -ProgDsply -LALL -aAll
</pre>

<li>Desktop machines, except pelops<br>Hardware RAID-1. The RAID status is reported upon reboot of a machine. Press Ctrl-C (when prompted) to enter the configuration utility. From within Linux, use (as root):
<pre>
mpt-status -i 0
mpt-status -i 2
</pre>
The second command only applies to machines with a second set of hard drives (achilles, agamemnon, nestor, poseidon)<br>
To allow regular users to verify the RAID status, the <tt>mpt-status</tt> has been added to <tt>sudo</tt>:
<pre>
sudo mpt-status -i 0
sudo mpt-status -i 2
</pre>
</li>

<li>Pelops: Software RAID (for OS and scratch partitions). See [[#hydra|Hydra]].</li>

==== Identifying a failed drive ====

Even if you know that a specific device (e.g., /dev/sdc) has failed, it is not always obvious which physical drive inside the machine this is. For this purpose, it is helpful to identify all other (still functional) drives from the command line. One way to do this is to read all drive serial numbers. For example, find the serial number of /dev/sda:
<pre>
udevadm info --query=all --name=/dev/sda | fgrep SERIAL
</pre>
Having the serial numbers allows you to connect device names to physical drives.

</ul>

=== Printers ===
There are two black-and-white laser printers (PS1 and PS2) in the lab, both supporting double-sided printing. For network printing, use luijten-ps1.ms.northwestern.edu and luijten-ps2.ms.northwestern.edu, respectively. For OS X, choose "HP JetDirect" as the protocol.
luijten-ps1.ms.northwestern.edu can be printed to from your personal laptop. To add this printer to your laptop go to add-printer > network printer. The address/hostname is luijten-ps1.ms.northwestern.edu. On Mac, choose Airprint > LPD protocol. On Windows, use TCP/IPP protocol. Printer Driver is a Brother-HL-L2300 series. The printer name is luijten-ps1.

=== Scanner ===

=== UPS ===

All our UPS units are manufactured by APC, and supported via apcupsd. Installation & configuration instructions:

<ul>

<li>Make sure sure the <tt>apcupsd</tt> package is installed, see [[Installation instructions for OpenSuSE 13.1]].</li>

<li>Connect UPS unit to USB port of the corresponding machine.</li>

<li>In <tt>/etc/apcupsd/apcupsd.conf</tt> edit these lines:
<pre>
UPSCABLE usb
UPSTYPE usb
</pre>
Also, '''comment out''' the <tt>DEVICE</tt> line.
</li>

<li>From command line, do
<pre>
chkconfig apcupsd on
</pre>
</li>

<li>Start the daemon manually:
<pre>
apcupsd
</pre>

<li>Test it:
<pre>
apcaccess
</pre>
This should produce extensive output regarding the UPS unit.<br>
(Note: this command also works for regular users; in that case use <tt>/usr/sbin/apcaccess</tt>.)

</li>

</ul>

Latest revision as of 10:23, 26 September 2019

Desktop machines

All desktop machines run OpenSuSE.

Etiquette: Running Nice on other desktop machines

Clusters

Minotaur

  • 38 nodes, each containing two 10-core processors (760 cores total). 16 GB memory per node.
    Processor type: Intel Xeon E5-2630, 2.2 GHz.
  • Jobs are scheduled via Torque/Maui. Notes on Torque.

Hydra

  • 60 nodes, each containing two 6-core processors (720 cores total). 12 GB memory per node.
    8 nodes (queue "fast", nodes h001-h008) have Intel Xeon X5690 3.47 GHz processors.
    52 nodes (queue "default", nodes h009-h060) have Intel Xeon E5645 2.40 GHz processors.
  • Jobs are scheduled via Torque/Maui. Notes on Torque.
  • General Usage of Hydra

Quest

Disk space, backups, and RAID storage

Disk space allocations and nightly backups

Each user has a home directory located on ariadne. This home directory is exported to all desktop machines, so that you see the same home filesystem on each machine. The drive is protected against hardware failure via a [RAID-1] setup. Furthermore, each night all new or modified files on /home are written to tape (located in ariadne). This makes it important not to store temporary data in your home folder, as it would quickly fill up the tape. Since users tend to forget this, a quota system has been enabled on ariadne, restricting each user to 15 GB. To check how much space you are using log on to ariadne and issue the command

quota -s

In addition, each user has significant additional storage on the scratch partitions. These drives are located in the different desktop machines and protected via RAID-1, but backups are your own responsibility. Note that these partitions are generally only mounted on the desktop machine that contains the corresponding drives. If you need a partition to be exported to a different machine, please ask.

Synology diskstation

Pergamon is a 18 TB RAID 6 diskstation. Detailed usage instructions.

Changing the nightly backup tape

  1. Press eject button on tape drive in ariadne.
  2. Take the tape cartridge out of the drive and put it in its box (should be on top of ariadne). Label the box. Give to Erik.
  3. Insert cleaning tape (on top of ariadne). It will work for less than a minute and then eject automatically.
  4. Put cleaning tape back in box on top of ariadne.
  5. Insert new DDS tape (find in cabinet). Leave empty box on top of ariadne.
  6. Erik: Update settings in /usr/local/lib/backup, namely position and tapenumber; update logfile.

Recovering data from the nightly backup tape

Log files of all nightly backup tapes are located on ariadne, in /usr/local/lib/backup. For privacy reasons, these logfiles are only accessible to root. Once the proper file to be recovered has been identified, insert the corresponding tape into the drive on ariadne and follow these steps (all to be executed as root):

  1. cd /
    (if you change to a different directory, the recovered file will be placed relative to this directory)
  2. /usr/local/bin/tape-rewind
    (or mtst -f /dev/nst0 rewind)
  3. mtst -f /dev/nst0 fsf <position>
    (see the contents file in /usr/local/lib/backup for the position number)
  4. tar xzvf /dev/nst0 <full_file_name_without_leading_slash>
    This step won't work unless you omit the leading slash; also note that you can specify multiple files, separated by spaces. The 'z' option is necessary because all nightly backups are compressed. For wildcards, use --wildcards and escape '*' and '?'. For example: tar -x --wildcards -zvf /dev/nst0 \*datafiles\*
  5. /usr/local/bin/tape-rewoffl
    (or mtst -f /dev/nst0 rewoffl)

Archiving data using the LTO tape drive

Checking RAID status

  • Hydra
    • OS is on software RAID (which spans /dev/sda and /dev/sdb). An overview is obtained via
      cat /proc/mdstat

      Detailed information via

      mdadm --detail /dev/mdX
      
      where X = 1, 5, 6, 7. Also see Setting up e-mail notifications for Linux Software RAID.
    • /home: 11 TB RAID 6. Check via
      opera http://127.0.0.1:81
      

      If this doesn't work, restart Areca HTTP server via /etc/rc.d/arecaweb script. It is also possible to interrogate the controller from the command line via 'cli64'. Useful commands within cli64 include 'vsf info' (volume set information) and 'disk info'.

    • /archive: 80 TB RAID (RAID 60). Check via
      opera http://127.0.0.1:82
      

      Just as for /home, it is possible to use the 'cli64' command-line interface. However, to switch this to the proper controller, first use 'set curctrl=2'.

  • Minotaur
    RAID-6 controller with 12 drives (incl. one hot spare). Web interface. Log in to the head node and use
    opera http://172.16.0.101
    

    RAID status is visible on the top line of the Raid Set Hierarchy table, under Volume State. For drive stability click any channel and find the SMART Attributes at the bottom of the page. Each has two values, Attribute and Threshold (Threshold is in parentheses). An Attribute value lower than Threshold indicates an unstable drive.

  • Ariadne
    RAID-5 controller with 4 drives. Status can be checked by interrogating the controller:
    /opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aALL | less
    

    In the 'Device Present' section, it is reported if any drives are critical or have failed, and what the state of the RAID is. More detailed information can also be found via

    /opt/MegaRAID/MegaCli/MegaCli64 -LDPDInfo -aAll | less
    
    Directly at the beginning (under 'Adapter #0') it should report 'State: Optimal'
  • To start a volume consistency check:
    /opt/MegaRAID/MegaCli/MegaCli64 -LDCC -Start -LALL -aAll
    

    and to monitor the progress of this check:

    /opt/MegaRAID/MegaCli/MegaCli64 -LDCC -ShowProg -LALL -aAll
    

    or, for continuous display:

    /opt/MegaRAID/MegaCli/MegaCli64 -LDCC -ProgDsply -LALL -aAll
    
  • Desktop machines, except pelops
    Hardware RAID-1. The RAID status is reported upon reboot of a machine. Press Ctrl-C (when prompted) to enter the configuration utility. From within Linux, use (as root):
    mpt-status -i 0
    mpt-status -i 2
    

    The second command only applies to machines with a second set of hard drives (achilles, agamemnon, nestor, poseidon)
    To allow regular users to verify the RAID status, the mpt-status has been added to sudo:

    sudo mpt-status -i 0
    sudo mpt-status -i 2
    
  • Pelops: Software RAID (for OS and scratch partitions). See Hydra.
  • Identifying a failed drive

    Even if you know that a specific device (e.g., /dev/sdc) has failed, it is not always obvious which physical drive inside the machine this is. For this purpose, it is helpful to identify all other (still functional) drives from the command line. One way to do this is to read all drive serial numbers. For example, find the serial number of /dev/sda:

    udevadm info --query=all --name=/dev/sda | fgrep SERIAL
    

    Having the serial numbers allows you to connect device names to physical drives.

Printers

There are two black-and-white laser printers (PS1 and PS2) in the lab, both supporting double-sided printing. For network printing, use luijten-ps1.ms.northwestern.edu and luijten-ps2.ms.northwestern.edu, respectively. For OS X, choose "HP JetDirect" as the protocol. luijten-ps1.ms.northwestern.edu can be printed to from your personal laptop. To add this printer to your laptop go to add-printer > network printer. The address/hostname is luijten-ps1.ms.northwestern.edu. On Mac, choose Airprint > LPD protocol. On Windows, use TCP/IPP protocol. Printer Driver is a Brother-HL-L2300 series. The printer name is luijten-ps1.

Scanner

UPS

All our UPS units are manufactured by APC, and supported via apcupsd. Installation & configuration instructions:

  • Make sure sure the apcupsd package is installed, see Installation instructions for OpenSuSE 13.1.
  • Connect UPS unit to USB port of the corresponding machine.
  • In /etc/apcupsd/apcupsd.conf edit these lines:
    UPSCABLE usb
    UPSTYPE usb
    

    Also, comment out the DEVICE line.

  • From command line, do
    chkconfig apcupsd on
    
  • Start the daemon manually:
    apcupsd
    
  • Test it:
    apcaccess
    

    This should produce extensive output regarding the UPS unit.
    (Note: this command also works for regular users; in that case use /usr/sbin/apcaccess.)