Hardware: Difference between revisions
(66 intermediate revisions by 6 users not shown) | |||
Line 1: | Line 1: | ||
=== Desktop machines === |
=== Desktop machines === |
||
All desktop machines run [http://www.opensuse.org OpenSuSE]. |
|||
* [[Installation instructions for OpenSuSE 13.1]]. |
|||
* [[Installation instructions for OpenSuSE Leap 42.1]]. |
|||
* [[Installation instructions for OpenSuSE Leap 15.0]]. |
|||
* [[Installation instructions for OpenSuSE Leap 15.1]]. |
|||
[[Etiquette: Running Nice on other desktop machines]] |
|||
=== Clusters === |
=== Clusters === |
||
==== Minotaur ==== |
==== Minotaur ==== |
||
* 38 nodes, each containing two |
* 38 nodes, each containing two 10-core processors (760 cores total). 16 GB memory per node.<br>Processor type: Intel Xeon E5-2630, 2.2 GHz. |
||
* Jobs are scheduled via Torque/Maui. [[Notes on |
* Jobs are scheduled via [http://www.adaptivecomputing.com/products/open-source/torque/ Torque]/Maui. [[Notes on Torque]]. |
||
==== Hydra ==== |
==== Hydra ==== |
||
* 60 nodes, each containing two 6-core processors (720 cores total). 12 GB memory per node.<br>8 nodes (queue "fast", nodes h001-h008) have Intel Xeon X5690 3.47 GHz processors.<br>52 nodes (queue "default", nodes h009-h060) have Intel Xeon E5645 2.40 GHz processors. |
* 60 nodes, each containing two 6-core processors (720 cores total). 12 GB memory per node.<br>8 nodes (queue "fast", nodes h001-h008) have Intel Xeon X5690 3.47 GHz processors.<br>52 nodes (queue "default", nodes h009-h060) have Intel Xeon E5645 2.40 GHz processors. |
||
* Jobs are scheduled via Torque/Maui. [[Notes on |
* Jobs are scheduled via [http://www.adaptivecomputing.com/products/open-source/torque/ Torque]/Maui. [[Notes on Torque]]. |
||
* [[General Usage of Hydra]] |
|||
==== Quest ==== |
==== Quest ==== |
||
* Jobs are scheduled via [http://www.adaptivecomputing.com/products/open-source/torque/ Torque]/Moab. [[Notes on Torque]]. |
|||
* [[General Usage of Quest]] |
|||
=== Disk space, backups, and RAID storage === |
=== Disk space, backups, and RAID storage === |
||
==== Disk space allocations ==== |
==== Disk space allocations and nightly backups ==== |
||
Each user has a home directory located on ''ariadne''. This home directory is exported to all desktop machines, so that you see the same home filesystem on each machine. The drive is protected against hardware failure via a [[http://en.wikipedia.org/wiki/RAID_1#RAID_1 RAID-1]] setup. Furthermore, each night all new or modified files on /home are written to tape (located in ariadne). This makes it important not to store temporary data in your home folder, as it would quickly fill up the tape. Since users tend to forget this, a quota system has |
Each user has a home directory located on ''ariadne''. This home directory is exported to all desktop machines, so that you see the same home filesystem on each machine. The drive is protected against hardware failure via a [[http://en.wikipedia.org/wiki/RAID_1#RAID_1 RAID-1]] setup. Furthermore, each night all new or modified files on /home are written to tape (located in ariadne). This makes it important not to store temporary data in your home folder, as it would quickly fill up the tape. Since users tend to forget this, a quota system has been enabled on ariadne, restricting each user to 15 GB. To check how much space you are using log on to ariadne and issue the command |
||
<pre> |
|||
quota -s |
|||
</pre> |
|||
In addition, each user has significant additional storage on the scratch partitions. These drives are located in the different desktop machines and protected via RAID-1, but backups are your own responsibility. Note that these partitions are generally only mounted on the desktop machine that contains the |
In addition, each user has significant additional storage on the scratch partitions. These drives are located in the different desktop machines and protected via RAID-1, but backups are your own responsibility. Note that these partitions are generally only mounted on the desktop machine that contains the corresponding drives. If you need a partition to be exported to a different machine, please ask. |
||
==== |
==== Synology diskstation ==== |
||
Pergamon is a 18 TB RAID 6 diskstation. [[Synology diskstation|Detailed usage instructions.]] |
|||
==== Changing the nightly backup tape ==== |
|||
# Press eject button on tape drive in ariadne. |
|||
# Take the tape cartridge out of the drive and put it in its box (should be on top of ariadne). Label the box. Give to Erik. |
|||
# Insert cleaning tape (on top of ariadne). It will work for less than a minute and then eject automatically. |
|||
# Put cleaning tape back in box on top of ariadne. |
|||
# Insert new DDS tape (find in cabinet). Leave empty box on top of ariadne. |
|||
# Erik: Update settings in /usr/local/lib/backup, namely ''position'' and ''tapenumber''; update logfile. |
|||
==== Recovering data from the nightly backup tape ==== |
|||
Log files of all nightly backup tapes are located on ariadne, in /usr/local/lib/backup. For privacy reasons, these logfiles are only accessible to root. Once the proper file to be recovered has been identified, insert the corresponding tape into the drive on ariadne and follow these steps (all to be executed as root): |
|||
# <tt>cd /</tt><br>(if you change to a different directory, the recovered file will be placed relative to this directory) |
|||
# <tt>/usr/local/bin/tape-rewind</tt><br>(or <tt>mtst -f /dev/nst0 rewind</tt>) |
|||
# <tt>mtst -f /dev/nst0 fsf <position></tt><br>(see the contents file in /usr/local/lib/backup for the position number) |
|||
# <tt>tar xzvf /dev/nst0 <full_file_name_without_leading_slash></tt><br>This step won't work unless you omit the leading slash; also note that you can specify multiple files, separated by spaces. The 'z' option is necessary because all nightly backups are compressed. For wildcards, use --wildcards and escape '*' and '?'. For example: <tt>tar -x --wildcards -zvf /dev/nst0 \*datafiles\*</tt> |
|||
# <tt>/usr/local/bin/tape-rewoffl</tt><br>(or <tt>mtst -f /dev/nst0 rewoffl</tt>) |
|||
==== Archiving data using the LTO tape drive ==== |
==== Archiving data using the LTO tape drive ==== |
||
==== Checking RAID status ==== |
|||
<ul> |
|||
<li><span id="hydra">Hydra</span> |
|||
<ul> |
|||
<li>OS is on software RAID (which spans /dev/sda and /dev/sdb). An overview is obtained via |
|||
<pre>cat /proc/mdstat</pre> |
|||
Detailed information via |
|||
<pre> |
|||
mdadm --detail /dev/mdX |
|||
</pre> |
|||
where X = 1, 5, 6, 7. Also see [[Setting up e-mail notifications for Linux Software RAID]].</li> |
|||
<li>/home: 11 TB RAID 6. Check via |
|||
<pre> |
|||
opera http://127.0.0.1:81 |
|||
</pre> |
|||
If this doesn't work, restart Areca HTTP server via /etc/rc.d/arecaweb script. It is also possible to interrogate the controller from the command line via 'cli64'. Useful commands within cli64 include 'vsf info' (volume set information) and 'disk info'. |
|||
</li> |
|||
<li>/archive: 80 TB RAID ([http://en.wikipedia.org/wiki/Nested_RAID_levels#RAID_60_.28RAID_6.2B0.29 RAID 60]). Check via |
|||
<pre> |
|||
opera http://127.0.0.1:82 |
|||
</pre> |
|||
Just as for /home, it is possible to use the 'cli64' command-line interface. However, to switch this to the proper controller, first use 'set curctrl=2'. |
|||
</li> |
|||
</ul> |
|||
</li> |
|||
<li>Minotaur<br>RAID-6 controller with 12 drives (incl. one hot spare). Web interface. Log in to the head node and use |
|||
<pre> |
|||
opera http://172.16.0.101 |
|||
</pre> |
|||
RAID status is visible on the top line of the Raid Set Hierarchy table, under Volume State. For drive stability click any channel and find the SMART Attributes at the bottom of the page. Each has two values, Attribute and Threshold (Threshold is in parentheses). An Attribute value lower than Threshold indicates an unstable drive. |
|||
</li> |
|||
<li>Ariadne<br>RAID-5 controller with 4 drives. Status can be checked by interrogating the controller: |
|||
<pre> |
|||
/opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aALL | less |
|||
</pre> |
|||
In the 'Device Present' section, it is reported if any drives are critical or have failed, and what the state of the RAID is. More detailed information can also be found via |
|||
<pre> |
|||
/opt/MegaRAID/MegaCli/MegaCli64 -LDPDInfo -aAll | less |
|||
</pre> |
|||
Directly at the beginning (under 'Adapter #0') it should report 'State: Optimal'</li> |
|||
To start a volume consistency check: |
|||
<pre> |
|||
/opt/MegaRAID/MegaCli/MegaCli64 -LDCC -Start -LALL -aAll |
|||
</pre> |
|||
and to monitor the progress of this check: |
|||
<pre> |
|||
/opt/MegaRAID/MegaCli/MegaCli64 -LDCC -ShowProg -LALL -aAll |
|||
</pre> |
|||
or, for continuous display: |
|||
<pre> |
|||
/opt/MegaRAID/MegaCli/MegaCli64 -LDCC -ProgDsply -LALL -aAll |
|||
</pre> |
|||
<li>Desktop machines, except pelops<br>Hardware RAID-1. The RAID status is reported upon reboot of a machine. Press Ctrl-C (when prompted) to enter the configuration utility. From within Linux, use (as root): |
|||
<pre> |
|||
mpt-status -i 0 |
|||
mpt-status -i 2 |
|||
</pre> |
|||
The second command only applies to machines with a second set of hard drives (achilles, agamemnon, nestor, poseidon)<br> |
|||
To allow regular users to verify the RAID status, the <tt>mpt-status</tt> has been added to <tt>sudo</tt>: |
|||
<pre> |
|||
sudo mpt-status -i 0 |
|||
sudo mpt-status -i 2 |
|||
</pre> |
|||
</li> |
|||
<li>Pelops: Software RAID (for OS and scratch partitions). See [[#hydra|Hydra]].</li> |
|||
==== Identifying a failed drive ==== |
|||
Even if you know that a specific device (e.g., /dev/sdc) has failed, it is not always obvious which physical drive inside the machine this is. For this purpose, it is helpful to identify all other (still functional) drives from the command line. One way to do this is to read all drive serial numbers. For example, find the serial number of /dev/sda: |
|||
<pre> |
|||
udevadm info --query=all --name=/dev/sda | fgrep SERIAL |
|||
</pre> |
|||
Having the serial numbers allows you to connect device names to physical drives. |
|||
</ul> |
|||
=== Printers === |
=== Printers === |
||
There are two black-and-white laser printers (PS1 and PS2) in the lab, both supporting double-sided printing. For network printing, use luijten-ps1.ms.northwestern.edu and luijten-ps2.ms.northwestern.edu, respectively. For OS X, choose "HP JetDirect" as the protocol. |
|||
luijten-ps1.ms.northwestern.edu can be printed to from your personal laptop. To add this printer to your laptop go to add-printer > network printer. The address/hostname is luijten-ps1.ms.northwestern.edu. On Mac, choose Airprint > LPD protocol. On Windows, use TCP/IPP protocol. Printer Driver is a Brother-HL-L2300 series. The printer name is luijten-ps1. |
|||
=== Scanner === |
|||
=== UPS === |
|||
All our UPS units are manufactured by APC, and supported via apcupsd. Installation & configuration instructions: |
|||
<ul> |
|||
<li>Make sure sure the <tt>apcupsd</tt> package is installed, see [[Installation instructions for OpenSuSE 13.1]].</li> |
|||
<li>Connect UPS unit to USB port of the corresponding machine.</li> |
|||
<li>In <tt>/etc/apcupsd/apcupsd.conf</tt> edit these lines: |
|||
<pre> |
|||
UPSCABLE usb |
|||
UPSTYPE usb |
|||
</pre> |
|||
Also, '''comment out''' the <tt>DEVICE</tt> line. |
|||
</li> |
|||
<li>From command line, do |
|||
<pre> |
|||
chkconfig apcupsd on |
|||
</pre> |
|||
</li> |
|||
<li>Start the daemon manually: |
|||
<pre> |
|||
apcupsd |
|||
</pre> |
|||
<li>Test it: |
|||
<pre> |
|||
apcaccess |
|||
</pre> |
|||
This should produce extensive output regarding the UPS unit.<br> |
|||
(Note: this command also works for regular users; in that case use <tt>/usr/sbin/apcaccess</tt>.) |
|||
</li> |
|||
</ul> |
Latest revision as of 11:23, 26 September 2019
Desktop machines
All desktop machines run OpenSuSE.
- Installation instructions for OpenSuSE 13.1.
- Installation instructions for OpenSuSE Leap 42.1.
- Installation instructions for OpenSuSE Leap 15.0.
- Installation instructions for OpenSuSE Leap 15.1.
Etiquette: Running Nice on other desktop machines
Clusters
Minotaur
- 38 nodes, each containing two 10-core processors (760 cores total). 16 GB memory per node.
Processor type: Intel Xeon E5-2630, 2.2 GHz. - Jobs are scheduled via Torque/Maui. Notes on Torque.
Hydra
- 60 nodes, each containing two 6-core processors (720 cores total). 12 GB memory per node.
8 nodes (queue "fast", nodes h001-h008) have Intel Xeon X5690 3.47 GHz processors.
52 nodes (queue "default", nodes h009-h060) have Intel Xeon E5645 2.40 GHz processors. - Jobs are scheduled via Torque/Maui. Notes on Torque.
- General Usage of Hydra
Quest
- Jobs are scheduled via Torque/Moab. Notes on Torque.
- General Usage of Quest
Disk space, backups, and RAID storage
Disk space allocations and nightly backups
Each user has a home directory located on ariadne. This home directory is exported to all desktop machines, so that you see the same home filesystem on each machine. The drive is protected against hardware failure via a [RAID-1] setup. Furthermore, each night all new or modified files on /home are written to tape (located in ariadne). This makes it important not to store temporary data in your home folder, as it would quickly fill up the tape. Since users tend to forget this, a quota system has been enabled on ariadne, restricting each user to 15 GB. To check how much space you are using log on to ariadne and issue the command
quota -s
In addition, each user has significant additional storage on the scratch partitions. These drives are located in the different desktop machines and protected via RAID-1, but backups are your own responsibility. Note that these partitions are generally only mounted on the desktop machine that contains the corresponding drives. If you need a partition to be exported to a different machine, please ask.
Synology diskstation
Pergamon is a 18 TB RAID 6 diskstation. Detailed usage instructions.
Changing the nightly backup tape
- Press eject button on tape drive in ariadne.
- Take the tape cartridge out of the drive and put it in its box (should be on top of ariadne). Label the box. Give to Erik.
- Insert cleaning tape (on top of ariadne). It will work for less than a minute and then eject automatically.
- Put cleaning tape back in box on top of ariadne.
- Insert new DDS tape (find in cabinet). Leave empty box on top of ariadne.
- Erik: Update settings in /usr/local/lib/backup, namely position and tapenumber; update logfile.
Recovering data from the nightly backup tape
Log files of all nightly backup tapes are located on ariadne, in /usr/local/lib/backup. For privacy reasons, these logfiles are only accessible to root. Once the proper file to be recovered has been identified, insert the corresponding tape into the drive on ariadne and follow these steps (all to be executed as root):
- cd /
(if you change to a different directory, the recovered file will be placed relative to this directory) - /usr/local/bin/tape-rewind
(or mtst -f /dev/nst0 rewind) - mtst -f /dev/nst0 fsf <position>
(see the contents file in /usr/local/lib/backup for the position number) - tar xzvf /dev/nst0 <full_file_name_without_leading_slash>
This step won't work unless you omit the leading slash; also note that you can specify multiple files, separated by spaces. The 'z' option is necessary because all nightly backups are compressed. For wildcards, use --wildcards and escape '*' and '?'. For example: tar -x --wildcards -zvf /dev/nst0 \*datafiles\* - /usr/local/bin/tape-rewoffl
(or mtst -f /dev/nst0 rewoffl)
Archiving data using the LTO tape drive
Checking RAID status
- Hydra
- OS is on software RAID (which spans /dev/sda and /dev/sdb). An overview is obtained via
cat /proc/mdstat
Detailed information via
mdadm --detail /dev/mdX
where X = 1, 5, 6, 7. Also see Setting up e-mail notifications for Linux Software RAID. - /home: 11 TB RAID 6. Check via
opera http://127.0.0.1:81
If this doesn't work, restart Areca HTTP server via /etc/rc.d/arecaweb script. It is also possible to interrogate the controller from the command line via 'cli64'. Useful commands within cli64 include 'vsf info' (volume set information) and 'disk info'.
- /archive: 80 TB RAID (RAID 60). Check via
opera http://127.0.0.1:82
Just as for /home, it is possible to use the 'cli64' command-line interface. However, to switch this to the proper controller, first use 'set curctrl=2'.
- OS is on software RAID (which spans /dev/sda and /dev/sdb). An overview is obtained via
- Minotaur
RAID-6 controller with 12 drives (incl. one hot spare). Web interface. Log in to the head node and useopera http://172.16.0.101
RAID status is visible on the top line of the Raid Set Hierarchy table, under Volume State. For drive stability click any channel and find the SMART Attributes at the bottom of the page. Each has two values, Attribute and Threshold (Threshold is in parentheses). An Attribute value lower than Threshold indicates an unstable drive.
- Ariadne
RAID-5 controller with 4 drives. Status can be checked by interrogating the controller:/opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aALL | less
In the 'Device Present' section, it is reported if any drives are critical or have failed, and what the state of the RAID is. More detailed information can also be found via
/opt/MegaRAID/MegaCli/MegaCli64 -LDPDInfo -aAll | less
Directly at the beginning (under 'Adapter #0') it should report 'State: Optimal'
To start a volume consistency check:
/opt/MegaRAID/MegaCli/MegaCli64 -LDCC -Start -LALL -aAll
and to monitor the progress of this check:
/opt/MegaRAID/MegaCli/MegaCli64 -LDCC -ShowProg -LALL -aAll
or, for continuous display:
/opt/MegaRAID/MegaCli/MegaCli64 -LDCC -ProgDsply -LALL -aAll
Hardware RAID-1. The RAID status is reported upon reboot of a machine. Press Ctrl-C (when prompted) to enter the configuration utility. From within Linux, use (as root):
mpt-status -i 0 mpt-status -i 2
The second command only applies to machines with a second set of hard drives (achilles, agamemnon, nestor, poseidon)
To allow regular users to verify the RAID status, the mpt-status has been added to sudo:
sudo mpt-status -i 0 sudo mpt-status -i 2
Identifying a failed drive
Even if you know that a specific device (e.g., /dev/sdc) has failed, it is not always obvious which physical drive inside the machine this is. For this purpose, it is helpful to identify all other (still functional) drives from the command line. One way to do this is to read all drive serial numbers. For example, find the serial number of /dev/sda:
udevadm info --query=all --name=/dev/sda | fgrep SERIAL
Having the serial numbers allows you to connect device names to physical drives.
Printers
There are two black-and-white laser printers (PS1 and PS2) in the lab, both supporting double-sided printing. For network printing, use luijten-ps1.ms.northwestern.edu and luijten-ps2.ms.northwestern.edu, respectively. For OS X, choose "HP JetDirect" as the protocol. luijten-ps1.ms.northwestern.edu can be printed to from your personal laptop. To add this printer to your laptop go to add-printer > network printer. The address/hostname is luijten-ps1.ms.northwestern.edu. On Mac, choose Airprint > LPD protocol. On Windows, use TCP/IPP protocol. Printer Driver is a Brother-HL-L2300 series. The printer name is luijten-ps1.
Scanner
UPS
All our UPS units are manufactured by APC, and supported via apcupsd. Installation & configuration instructions:
- Make sure sure the apcupsd package is installed, see Installation instructions for OpenSuSE 13.1.
- Connect UPS unit to USB port of the corresponding machine.
- In /etc/apcupsd/apcupsd.conf edit these lines:
UPSCABLE usb UPSTYPE usb
Also, comment out the DEVICE line.
- From command line, do
chkconfig apcupsd on
- Start the daemon manually:
apcupsd
- Test it:
apcaccess
This should produce extensive output regarding the UPS unit.
(Note: this command also works for regular users; in that case use /usr/sbin/apcaccess.)