User's Guide Systems Wiki Monitoring More

Infrastructure Progress

News

2009-11-19
2009-11-13
  • Updated to libstdc++6 (4.3.2-1.1) from Lenny on headnodes and computenodes
2009-07-13
  • I stopped courier-mta on Owl to see if Thomas stops getting spam. We need to setup an independent mail server.

2009-07-13
  • Owl got frozen due to an unknown NFS defect. This happened at 4AM, I restarted Owl at 9:30AM. Some things are becoming clear about the NFS defect, it was probably triggered by this:
    • owl ib1 errors
             WARNINGs: packets is 10452405.01 (outside range [:1]).
2009-07-07
  • Today I created about 5 different containers for various purposes. One command that I always forget is:
    • sudo dpkg-reconfigure locales
      
      # and select en_US.UTF-8 UTF-8
    Also bash completion is tricky to install - it needs to be in your ~/.bashrc not /etc/bash.bashrc
2009-07-02
  • On node 27 there are R processes in a Noninterruptible sleep state:
    • grep " D " all-processes 
      node08      :   275 ?        D      0:09  \_ [pdflush]
      node27      :  5200 ?        D      6:55 /opt/R-2.9.0/lib64/R/bin/exec/R
      node27      : 30255 ?        D      0:00 [R]
      node27      : 30320 ?        D      0:00 [R]
      node27      : 19151 ?        D      0:00 [R]
      node27      : 19227 ?        D      0:00 [R]
      node27      : 32241 ?        D      0:00 [R]
2009-06-11


2009-06-11
Restarting the NFS server on Bicluster takes a very long time. Something is unhealthy about that.

Running /usr/sbin/rpc.mountd and /usr/sbin/rpc.nfsd could un-freeze the nfs-kernel restart init script.


2009-05-31

On Biocluster, when eth2 (192.168.2....) is down, many many problems occur Details.

2009-05-30
Fstabs on Biocluster changed to
  • localhost:/projects                /srv/projects   nfs4 bg,rw,soft,intr  0  0
    localhost:/home                    /home           nfs4 bg,rw,soft,intr  0  0
    localhost:/projects/home_girkelab  /home_girkelab  nfs4 bg,rw,soft,intr  0  0
    localhost:/profound/home_bazhlab   /home_bazhlab   nfs4 bg,rw,soft,intr  0  0
    localhost:/profound/home_sladeklab /home_sladeklab nfs4 bg,rw,soft,intr  0  0
      }}
    
     2009-05-29:: Wrote '''convert-clustalw-to-fasta'''. The name fully describes it's function. Available everywhere on Biocluster.<<BR>><<BR>>
    <<Anchor(stop-here)>>
    
     2009-05-28:: T-coffee 7.81 
      {{{
    sudo install -path=/opt/t-coffee

    but nothing was written to /opt/t-coffeee, only 2 man file were written to /usr/local (/usr/local/man/man1/mafft{,-homologs}.1)
    For details see : T-Coffee-install-help

2009-05-27
On space1, installed lib32gcc1 on space1 which was needed for hpasmcli
2009-05-18
  • wget http://www.micans.org/mcl/src/mcl-09-116.tar.gz
    tar zxvf mcl-09-116.tar.gz
    cd mcl-09-116/
    ./configure --prefix=/opt/mcl-09-116 --enable-blast
    make
    make install

--enable-blast is very important



2009-05-12
  •      
    sudo aptitude install lm-sensors
    sudo sensors-detect # press enter ~10 times
    sudo modprobe <suggested-module>
    sudo modprobe <suggested-module>
    sudo modprobe <suggested-module>
    sudo vi /etc/modules # add the suggested modules
    sudo sensors # you should see the temperatures, and possible voltages and fan speeds
    
    # sudo aptitude install munin munin-node
    cd /etc/munin/plugins
    sudo ln -s /usr/share/munin/plugins/sensors_ sensors_temp
    sudo ln -s /usr/share/munin/plugins/sensors_ sensors_fan  # if you have fan readings
    sudo ln -s /usr/share/munin/plugins/sensors_ sensors_volt # if you have voltage readings
    sudo /etc/init.d/munin-node restart
    
    • Kevin, thank you for this tip
2009-05-06
  • To free-up some space on bioweb, removed a 2005 version of PFAM (/srv/exports/PFAM-20050422112729/)
2009-05-06
  • space2:/srv/data/NCBI/ (189G) was larger than bioweb:/srv/exports/NCBI/ (158G) because the old sysadmin duplicated the data where symlinks occurred (probably used scp instead of rsync)
2009-05-06
  • Not being used:
    • space2:/srv/data/cellwall (mounted at bioweb:/srv/web/Cellwall_space)
    • space2:/srv/data/blast (mounted at bioweb:/srv/web/blast)
    • space2:/srv/data/privateBlast (mounted at bioweb:/srv/web/privateBlast)


2009-04-27
To add the Postgres server instrumentation functions to one needs to
  1. Install postgresql-contrib Debian package

  2. Run psql -U ADMIN_USERNAME postgres < /usr/share/postgresql/8.3/contrib/adminpack.sql


2009-04-16

Did the dd if=/dev/zerro and the dd of=/dev/null filesystem performance tests on the cluster nodes:

  • 170 (+/-5) Megabytes/second disk read speed

  • 75 (+/-10) Megabytes/second write speed


2009-04-15

EMBOSS Explorer installed in an Owl container http://emboss.bioweb.ucr.edu


2009-04-15

Synced packages on Owl and Biocluster. Try running compare-to-owl on Biocluster

  • Try running compare-to-owl on Biocluster


2009-04-14

Wrote The Check: /usr/local/bin/check (it looks at the CPU time since the process was started, so it's the processes life-time average)

  • Try running: check


2009-04-13
The freeze of Munin was due to owl being down. After commenting out /etc/munin/munin.conf and restarting /etc/init.d/munin-node --- everything started working again.
  • The first error in var/log/munin/munin-graph.log was

    Apr 11 21:29:44 - Unable to graph /var/lib/munin/headnodes/owl.headnodes-if_eth0-down-c.rrd: invalid rpn expression in: idown,adown,-

    The best way to diagnose Munin is still:

    • sudo tail -f /var/log/munin/*.log | grep -iE '(unable)|(err)|(warn)'


2009-03-28
  • Re-imaged nodes 7,9,18,19.23,24,25,26 and ran ldconfig on all the other nodes - this concludes the Amber "ptraj" fix for Qaiser
2009-03-28
  • Upgraded OpenMPI to v1.3.1 (2009-03-19) in biocluster:/usr/local


2009-02-17
  • Launched an OpenVZ container with Apache for serving the user homes because .htaccess functionality was needed. Look for "apache" /etc/lighttpd/front.conf and ssh apache-homes


2009-01-27
2009-01-27
  • The NAT on owl should be
    • sudo iptables -L -t nat
      Chain PREROUTING (policy ACCEPT)
      target     prot opt source               destination
      DNAT       tcp  --  anywhere             swiki               tcp dpt:www to:192.168.3.221:8080
      DNAT       tcp  --  anywhere             wiki                tcp dpt:www to:192.168.3.222:8080
      DNAT       tcp  --  anywhere             iwiki               tcp dpt:www to:192.168.3.223:8080
      DNAT       tcp  --  anywhere             manual              tcp dpt:www to:192.168.3.224:8080
    After the reboot the last entry did not survive. TODO: Find what restores the firewall on owl. To add these entries use this as an example
    • sudo iptables -t nat -A PREROUTING -d 192.168.3.224/32 -p tcp -m tcp --dport 80 -j DNAT --to-destination 192.168.3.224:8080
2009-01-27
  • The load growth on Owl started around Noon of the 26th, when we were doing load experiments on the NFS server:

    owl.headnodes-load-week.png

  • The host when down with
    Jan 26 12:35:05 owl kernel: nfs: server biocluster not responding, timed out
    Jan 26 12:36:05 owl kernel: nfs: server biocluster not responding, timed out
    Jan 26 12:37:05 owl kernel: nfs: server biocluster not responding, still trying
    Jan 26 12:38:05 owl kernel: INFO: task sh:27339 blocked for more than 120 seconds.
    Jan 26 12:38:05 owl kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Jan 26 12:38:05 owl kernel: sh            D ffff8109d380f090     0 27339  27337
    Jan 26 12:38:05 owl kernel:  ffff810ba39298f8 0000000000000082 ffff810ba39298c0 0000000000000000
    Jan 26 12:38:05 owl kernel:  0000000000000000 ffff810ba39298a8 0000000000000000 ffffffff805b9200
    Jan 26 12:38:05 owl kernel:  ffffffff805b9200 ffffffff805b9200 ffffffff805b9200 ffffffff805b9200
    Jan 26 12:38:05 owl kernel: Call Trace:
    Jan 26 12:38:05 owl kernel:  [<ffffffffa034db76>] :sunrpc:rpc_wait_bit_killable+0x0/0x31
    Jan 26 12:38:05 owl kernel:  [<ffffffffa034dba0>] :sunrpc:rpc_wait_bit_killable+0x2a/0x31
    Jan 26 12:38:05 owl kernel:  [<ffffffff80415158>] __wait_on_bit+0x40/0x6e
    Jan 26 12:38:05 owl kernel:  [<ffffffffa034db76>] :sunrpc:rpc_wait_bit_killable+0x0/0x31
    Jan 26 12:38:05 owl kernel:  [<ffffffff804151f2>] out_of_line_wait_on_bit+0x6c/0x78
    Jan 26 12:38:05 owl kernel:  [<ffffffff80243fe4>] wake_bit_function+0x0/0x23
    Jan 26 12:38:05 owl kernel:  [<ffffffffa034a62d>] :sunrpc:xprt_connect+0x89/0x124
    Jan 26 12:38:05 owl kernel:  [<ffffffffa034e1a2>] :sunrpc:__rpc_execute+0x139/0x2af
    Jan 26 12:38:05 owl kernel:  [<ffffffffa03475a7>] :sunrpc:rpc_run_task+0x5a/0x62
    Jan 26 12:38:05 owl kernel:  [<ffffffffa0347644>] :sunrpc:rpc_call_sync+0x3e/0x5b
    Jan 26 12:38:05 owl kernel:  [<ffffffffa03a7c69>] :nfs:nfs4_proc_access+0x147/0x1ca
    Jan 26 12:38:05 owl kernel:  [<ffffffff8027197b>] __rmqueue_smallest+0x88/0x10a
    Jan 26 12:38:05 owl kernel:  [<ffffffff80271a18>] __rmqueue+0x1b/0x1c1
    Jan 26 12:38:05 owl kernel:  [<ffffffff80271c04>] rmqueue_bulk+0x46/0x8f
    Jan 26 12:38:05 owl kernel:  [<ffffffff802798b5>] zone_statistics+0x3c/0x90
    Jan 26 12:38:05 owl kernel:  [<ffffffff80273137>] get_page_from_freelist+0x48f/0x60d
    Jan 26 12:38:05 owl kernel:  [<ffffffffa039224a>] :nfs:nfs_do_access+0x161/0x30a
    Jan 26 12:38:05 owl kernel:  [<ffffffffa03924df>] :nfs:nfs_permission+0xec/0x15b
    Jan 26 12:38:05 owl kernel:  [<ffffffff8029eb78>] permission+0xa9/0xf4
    Jan 26 12:38:05 owl kernel:  [<ffffffff8029ffdb>] __link_path_walk+0x143/0xdda
    Jan 26 12:38:05 owl kernel:  [<ffffffff802a0cb8>] path_walk+0x46/0x8b
    Jan 26 12:38:05 owl kernel:  [<ffffffff802a0fe2>] do_path_lookup+0x154/0x1ce
    Jan 26 12:38:05 owl kernel:  [<ffffffff802a1af4>] __path_lookup_intent_open+0x56/0x97
    Jan 26 12:38:05 owl kernel:  [<ffffffff8029b23a>] open_exec+0x24/0xb2
    Jan 26 12:38:05 owl kernel:  [<ffffffff8027e063>] handle_mm_fault+0x3db/0x829
    Jan 26 12:38:05 owl kernel:  [<ffffffff80280ec2>] vma_merge+0x141/0x1ee
    Jan 26 12:38:05 owl kernel:  [<ffffffff8029c2f8>] do_execve+0x74/0x20f
    Jan 26 12:38:05 owl kernel:  [<ffffffff8020a47f>] sys_execve+0x35/0x4c
    Jan 26 12:38:05 owl kernel:  [<ffffffff8020c2ca>] stub_execve+0x6a/0xc0
    Fixed by rebooting.


2009-01-26
  • Set the IO scheduling of /home back to completely fair queuing.
    sudo su -c "echo cfq > /sys/block/sdc/queue/scheduler"
    A good way to see all the IO scheduling policies is:
    tail /sys/block/sd*/queue/scheduler


2009-01-20
  • Re-imaging node07 and node10.


2009-01-16

Disabling Kernel Randomization of stack pointer on node32 (it will be re-imaged to the rest of the nodes). This is in response to the warning in the NAMD output. See Worker-Node-Kernel-Configuration for details.


2009-01-15

Removed /usr/local/sbin/munin-generate-aggrigations and all the related junk from /etc/munin/munin.conf becouse this and this is the new generation of graphs.


2009-01-15

node21 got InfiniBand connection down. Same as the other cases. Caused by NAMD crash. Interestingly the namd processes were still running, but were impossible to terminate.


2009-01-13

node12 got InfiniBand connection down. Same as the other cases. Caused by NAMD crash.


2009-01-10

node07 still appeared down in qnodes -l. Fixed by restarting pbs_mom

2009-01-08 (night)

node07 got InfiniBand connection down. Same.

2009-01-07

node16 got InfiniBand connection down. Fix by rebooting. Syslog had a bunch of usuall:

  • Jan  7 23:07:54 node16 kernel: NETDEV WATCHDOG: ib1: transmit timed out
    Jan  7 23:07:54 node16 kernel: ib1: transmit timeout: latency 4126424 msecs
    Jan  7 23:07:54 node16 kernel: ib1: queue stopped 1, tx_head 877748, tx_tail 877684
    and
    Jan  7 23:26:00 node16 kernel: mlx4_core 0000:01:00.0: SW2HW_MPT failed (-16)
    Jan  7 23:26:20 node16 kernel: mlx4_core 0000:01:00.0: HW2SW_MPT failed (-16)
    Jan  7 23:26:40 node16 kernel: mlx4_core 0000:01:00.0: SW2HW_MPT failed (-16)


2009-01-07

Implemented IO-by-User Monitoring on node01

2009-01-07

Documented the Users-Master-List

2009-01-07

In biocluster:/usr/local created commit 9f561e7: Implemented a nice timeout tool: run-with-timeout


2009-01-06

Installed iotop on all nodes :) :) :)

2009-01-06

node11 got InfiniBand connection down. Fixed by restarting.


2009-01-05

Recorded all Verari Service Numbers


2009-01-05
Node11 lost IB temporary.
  • tail -f /var/log/syslog
    Jan  5 17:52:05 node11 kernel: NETDEV WATCHDOG: ib1: transmit timed out
    Jan  5 17:52:05 node11 kernel: ib1: transmit timeout: latency 3996704 msecs
    Jan  5 17:52:05 node11 kernel: ib1: queue stopped 1, tx_head 302521, tx_tail 302457


2009-01-03
Node01 is still invisible to the VSM.
2009-01-03
Restored VCC Database to the 2008-12-01 snapshot because a hand full of nodes in shelf 1 became invisible to the VSM.


2008-12-20

node28 has IB packet loss

2008-12-30

node07 and node20 have their InfiniBand connections down. They were both running Quaiser's tasks. Fixed by restarting.

  • In syslog both nodes are printing the following once per second with slightly different values:
    kernel: NETDEV WATCHDOG: ib1: transmit timed out
    kernel: ib1: transmit timeout: latency 420676572 msecs
    kernel: ib1: queue stopped 1, tx_head 242070168, tx_tail 242070104
2008-12-30

Backup ran out of space, becouse a large amount of files were copied from /home to /profound and then deleted from /home for mbazhenov.


2008-12-29

I accidentally gave Owl the ip of Biocluster (placed it's MAC address as identified of Biocluster in /etc/biocluster :( ). This started kicking users out of Biocluster and placing them into Owl when they tried to logging in.

2008-12-23
From LSI SANtricity software
  • Drive at Tray 0, Slot 5
    Status: Failed
    Mode: Assigned
    Raw capacity: 279.397 GB
    Usable capacity: 278.897 GB
    World-wide identifier: 20:00:00:1d:38:58:48:ca:00:00:00:00:00:00:00:00
    Associated volume group: 1
    
    Port 0; Channel 1; ID 35/0xAD
    Port 1; Channel 2; ID 18/0xCB
    
    Drive path redundancy: OK
    Drive type: Fibre Channel
    
    Speed:
     15015 RPM
    
    Current data rate:
     4 Gbps
    
    Product ID:
     ST3300655FC
    
    Firmware version:
     MS08
    
    Serial number:
     3LM3MWHP00009828SWN9
    
    Vendor:
     SEAGATE
    
    Date of manufacture:
     February 21, 2008
2008-12-23
  • Owl had a problem when shutting down. IB's stop rc script runs before NFS stop rc script, so NFS freezes and halts the entire shutdown sequence.


2008-12-22
Node09 does not respond the ping on the IB network. Eth network is fine.
  • (!) Looks like a node that dies after a failed NAMD submission. Fixed by rebooting.

    (!) Also had to restart pbs_mom after booting.


2008-12-21

The SilverStorm switch web-ui shows some unhealthy logs

2008-12-21
Node28 does not respond the ping on the IB network. Eth network is fine.
  • (!) Looks like a node that dies after a failed NAMD submission. Fixed by rebooting and placed back into queuing system.


2008-12-19
Node26 and Node19 have serious ping problems.
2008-12-19
Node26 lost NFS.
2008-12-19

Nodes 14,15,16,17(possibly), and 18 (less possibly) had qstat status E for the tasks that were on them. This is becouse I temporary disconnected them form the IB network (did not suspect that Tourqe would be that sensitive).


2008-12-17
  • Node27 was not responding (even to Keyboard and Monitor). Showed I/O errors on sda. After restart BIOS could not locate the hard drive. I opened the node and swapped the SATA cable that was in the drive with the other free one that was in the node. It is now working. But I had to restarted pbs_mom to get qnodes state change from DOWN to FREE.

2008-12-16
  • Restarted pbs_mom on node21 to get it to change state (per qnodes) from DOWN to FREE.


2008-12-15
  • In /etc/munin/plugin-conf.d/munin-node the list of users for head node CPU accounting is set to

    • afatmi
    • alevchuk
    • ebolotin
    • lgao
    • qwu
    • root
    • tgirke
    • xpcui
2008-12-15
  • /etc/munin/plugins/cpu and /etc/munin/plugins/cpubyuser were updated to adjust how the graph is displayed.

2008-12-15
  • Started using Munin to perform the quota monitoring. Added Thomas and I are on the notification list. The standard quota package is only used as an accounting mechanism for quota, it does not have any soft or hard limits. The equivalent of "soft" limits are set in /etc/munin/munin.conf but it only sends out notifications and does not have any grace period.

2008-12-15
  • On the Dell workstations (core3, core4, batch2145d, keenhall1008), bocked the monitoring systems warnings for the S.M.A.R.T. values, becouse they seem to be irrelevant and the following command reports PASSED
    sudo smartctl -H /dev/hda3
2008-12-15
  • On Biocluster, replaced vim-tiny with the full version:
    [INSTALL] vim-full
    [INSTALL] vim-latexsuite
    [INSTALL] vim-lesstif
    [INSTALL] vim-perl
    [INSTALL] vim-python
    [INSTALL] vim-ruby
    [INSTALL] vim-scripts
    [INSTALL] vim-tcl
    [REMOVE] vim-tiny
2008-12-15
  • On Biocluster, did Python 2.4 and Perl 5.8 minor Debian updates with aptitude:

    [UPGRADE] libperl5.8 5.8.8-7etch3 -> 5.8.8-7etch5
    [UPGRADE] perl 5.8.8-7etch3 -> 5.8.8-7etch5
    [UPGRADE] perl-base 5.8.8-7etch3 -> 5.8.8-7etch5
    [UPGRADE] perl-doc 5.8.8-7etch3 -> 5.8.8-7etch5
    [UPGRADE] perl-modules 5.8.8-7etch3 -> 5.8.8-7etch5
    [UPGRADE] python2.4 2.4.4-3+etch1 -> 2.4.4-3+etch2
    [UPGRADE] python2.4-dev 2.4.4-3+etch1 -> 2.4.4-3+etch2
    [UPGRADE] python2.4-minimal 2.4.4-3+etch1 -> 2.4.4-3+etch2
2008-12-15
  • The Wiki container backed-up.


2008-12-9
  • The middle shelf on the Biocluster's SAN shows a /!\ (exclamation mark in a triangle)

2008-12-9
  • Looks like node21 was knocked out by NAMD. Rebooting via Verari web control panel..

2008-12-8
  • Installed  /home/khoran/downloads/pbzip2_1.0.3-1_amd64.deb on Biocluster, node02, and node32. This is a parallel bzip compressor.

2008-12-6
  • Undocumented command from the past (access to the disk IO scheduler policy):
    • sudo cat /sys/block/sd{a,b,c,d}/queue/scheduler


2008-11-27

TODO List 1b

2008-11-27

Wrote the namd-kill-all-jobs script. It's ancestor is namd-kill-job. Both scripts are located in /usr/local/bin/


2008-11-26

First short-term TODO list: TODO List 1a


2008-11-20

In the last 7 days there were some changes, problems, and solutions related to NFS.


2008-11-18

For a still unknown reason, OpenVZ checkpointing stopped working correctly after the recent reboot of Owl

2008-11-18
The number of NFS threads was increased from 80 to 160, becouse all 80 were staying busy.
  • After the NFS restart, the number of busy threads went up to 134. The current "load" of the system is 143.
  • All compute nodes are still very slow on IO system calls, but fortunately no "Stale NFS handles" are observed.
  • After about one hour, the number of busy NFS threads dropped to 0.
2008-11-18

Nodes 4-to-23 and 27-to-32 were re-imaged. All compute-nodes except node12, node13, and node19 are available for task submission (using qsub).


2008-11-12

Installed all software on http://molpopgen.org/software/lseqsoftware.html which are using libsequence which was also installed recently


2008-11-01

On the ackups host /var usage spikes stopped after I added PRUNEPATHS="/srv /var/lib/vz" to /etc/updatedb.conf. Maybe this should be re-enabled later - so that users can search their backups.


2008-10-23
Apporx Debian mirror changed
2008-10-21
Biocluster had symptoms similar to the Oct 17th crash, but I prevented the crash by terminating the process that was using 98% of the memory. The process was an R script that one of our researchers was running.
2008-10-17

Biocluster 2008-10 Crash soon after I re-imaged all nodes.


2008-09

Found the OpenSM problem


2008-09-03
Uninstalled php4-cgi and php4-common from Biocluster


2008-09-03

Created the namd-kill-job and namd-start. See Using NAMD.

2008-09-02

Created the ping-ibs script. See Pinging All Infiniband Cards.

2008-08-27

Work related bookmarks of Aleksandr Levchuk can be found here: http://ihooh.com/tags/?id=75146483

2008-08-24
After rebooting all the nodes, node27 did not boot. After rebooting 2 times it when back on line.
2008-08-23
The Verari Service Module will power down nodes even if you say POWER UP on the nodes that are already booted. I learned it the wrong way.
2008-08-22
One broken RAM stick can fool you into replacing the motherboard. In some slots a broken RAM stick can stall the Verari nodes so that they will never reaching any BIOS. A broken RAM stick in other slots does not cause this failure. I could not figure out which exact slots do cause this, but the every-4rth blue slot does not do the trick. Placing the broken stick into 2 nodes (3 different motherboards), made it seem that the killer slot is somewhere around the second blue slot (counting from the front panel on the node).


2008-08-19

Discovered that in order for ulimit to take effect for all users in must be set by root and then su into a user. In other words set it in the script that launched pbs_mom.


2008-08-15
Changed /home from XFS to EXT3.


2008-08-06

Documented the Creating a New Container procedure


2008-08-02

Documented the Making a Template from a Container procedure


2008-08-01

Documented the Version Controlling the Biobuntu Template procedure

2008-08-01

Wrote the wget-to-destination script. It is good for putting files into containers that do not have their own Internet.


2008-07-30
I found an interesting behavior of Bash:
  • # 1.
    `echo; sleep 10` #(include the back-ticks)
    
    # 2. Press C-z. 

This will lock you out of your shell. The normal C-c will not work. The only way to get out it is to run a kill from another shell.

(./) Tested on Debian 4 and !RedHat 5.

2008-07-30
The Perl's package manager CPAN upgraded to script version 1.9, CPAN.pm version 1.9205


2008-07-28

Some nodes still have not been re-imaged. Check with on-all-nodes-run sudo aptitude show libc6 \| grep Version


2008-07-22

On Biocluster the files in .html/ in homes are now served as static web content. For example http://biocluster.ucr.edu/~tgirke/


2008-07-18

{X} Node26 is down due to packet look in the infiniband link.

2008-07-22

(./) Fixed


2008-07-17

{X} There is a hard to reproduce problem with /etc/init.d/biocluster-infiniband on the compute nodes. The script it adds duplicate ib1 entries to /etc/network/interfaces. Although I re-imaged and re-booted about 20 nodes at different time, this problem happened only twice in that session.


2008-07-16

Looks like NTP server needs to be stopped before you can run sudo ntpdate-debian, otherwise you will get the NTP socket is in use, exiting error. So, Biocluster is our only NTP server. This will be OK for the projected implementation of the Owl fail-over for the cause of hardware failure in Biocluster, because Owl will become an absolute replacement an will launch the NTP server.


2008-07-12

Using Aptitude on All Nodes at Once works.


2008-07-08

Figured out why my script on-all-nodes-run was waiting for some nodes indefinably. Re-imaging all free nodes to fix the problem. Manually fixing of the busy nodes 1, 2, 4, 5, and 14.


2008-07-06

While Kerberos is working correctly on all nodes, Node07 started asking for password when SSHing from Biocluster. on-all-nodes-run was working fine on all nodes (except 09 and 30) on Thursday.

  • :) Figured out: auth.log: Unknown code krb5 37


2008-07-03
Network switched to the 10 Gigabit Ethernet in the Data Center


2008-07-01

Documented the Resetting Torque system administrating procedure.

2008-07-01

Documented the Reserving Nodes system administrating procedure.

2008-07-01

All nodes (except node27 and node28) re-imaged and ready for centralized software installation. To install software please follow the procedure documented in Software in Local.


2008-06-30

qsub and all the other torque operations became available on the worker nodes. It just happened as a side effect after centralizing /usr/local

2008-06-30
Fixed Ethernet connections for node28 and node31. On both nodes the eth0 ports are broken. Had to reconnect the cables to eth1, update the hardware MAC addresses in /etc/dhcd3/dhcpd.conf, and re-image the nodes.


2008-06-27

Fixed torque by installing not as a package. This fixes xpbmon. Made the jobs of other visible with qstat. For details see Queuing with Torque


2008-06-25

Installed the blast2 package on all the nodes. Fixed the permissions on the /scratch partition.


2008-06-24

This afternoon we went into production. The following Announcement was sent out to all our users.


2008-06-20 8AM

Reverted bioinfo back to what it was. There is an issue with mounting Biocluster's fs on bioinfo's nodes.

2008-06-20 8AM

Re-configured bioinfo so that it mounts the home of Biocluster

2008-06-20 8AM

On bioinfo, moved /srv/exports/home to /srv/exports/home-bioinfo

2008-06-20 7AM

Brought down NFS server on bioinfo

2008-06-20 2AM

Biocluster re-imaged. Came up with the following issues:

  • {X}

    SAN not mounted

    What is really needed is a good fstab for Biocluster. See below. Fixed Temporarily!

    • # /etc/fstab: static file system information.
      #
      # <file system>      <mount point>  <type>      <options>    <dump> <pass>
      proc                 /proc                      proc        defaults     0  0
      /dev/sda2            /                          xfs         defaults     0  1
      /dev/sda1            /boot                      ext3        defaults     0  2
      /dev/sda7            /tmp                       xfs         defaults     0  2
      /dev/sda5            /usr                       xfs         defaults     0  2
      /dev/sda6            /var                       xfs         defaults     0  2
      /dev/sda8            /scratch                   xfs         defaults     0  2
      
      LABEL="san/projects" /srv/projects              xfs         defaults     0  2
      LABEL="san/homes"    /home                      xfs         defaults     0  2
      LABEL="san/profound" /profound                  xfs         defaults     0  2
      
      /srv/projects        /srv/nfs4exports/projects  none        rw,bind      0  0
      /home                /srv/nfs4exports/home      none        rw,bind      0  0
      /profound            /srv/nfs4exports/profound  none        rw,bind      0  0
      
      /dev/sda9            none                       swap        sw           0  0
      /dev/hdb             /media/cdrom0              udf,iso9660 user,noauto  0  0
  • {X}

    KDC down

    the init scripts krb5-kdc and  krb5-admin-server were looking for the old /usr/local binaries. Fixed Permanently!

    {X}

    biocluster172 hostname

    This is result form systemimager. That's why completing /etc/init.d/biocluster-network-config is important. The final hostname must be biocluster.ucr.edu. Fixed Temporarily!

    {X}

    Never appeared in /var/lib/systemimager/clients.xml

    I'm not sure what happened there. I thought I configured the monitoring correctly.


  • 2008-06-19 6PM

    Owl re-imaged. The /home, /profound, and /srv/projects NFS mounts were not present in /etc/fstabs, of course.

    2008-06-19 12PM

    Fixed the "Name to ID mapping"/"Large Groups" problem by upgrading to libnfsidmap-0.20. The upgrade was invisible to the package manager (by simply overwriting /usr/lib/libnfsidmap.so.0.2.0 on the worker nodes and the head nodes (the older version was 40008 bytes; the new one is 44240 bytes)). We did this because we are anticipating the same fix to occur in the upgrade to Lenny. See: /root/src/alevchuk/second-quarter/ and Users and Groups


    2008-06-18 5PM

    I was going to but did not obviated the "Name to ID mapping problem" in NFSv4. See Users and Groups of details. Also see my mailing list follow-up http://linux-nfs.org/pipermail/nfsv4/2008-June/008814.html

    2008-06-18 9AM

    Question about the NFSv4 "nogroup" problem posted: http://linux-nfs.org/pipermail/nfsv4/2008-June/008803.html


    2008-06-17 3PM

    Updating /etc/passwd and /etc/group on Biocluster and Owl.

    2008-06-17 12PM

    After changing the hostname of Biocluster from biocluster.ucr.edu to biocluster Kerberos SSH connection stopped working. The /var/log/auth.log says debug1: Got no client credentials and the connection gets closed immediately. This could be a hard to find problem.


    2008-06-16 9PM

    {X} node09 fails to bring up Infiniband interface. sudo cat /var/log/dmesg | grep Mel gives empty output.

    2008-06-16 9PM

    {X} qsub -I does not work because pbs_mom on the worker nodes cannot open a connection back to pbs_server on the head node. This happens because the client tries to connect to biocluster.ucr.edu.

    • (./)

      I verified that adding biocluster.ucr.edu at the end of the 192.168.3.21 line in /etc/hosts on node of the worker nodes fixes this issue.

      (./)

      I tried to fix it without having the ucr.edu entry on the internal network by changing /var/spool/torque/server_name from localhost to biocluster. This would verify the hypothesis that pbs_mom on the worker nodes is tying biocluster.ucr.edu because /var/spool/torque/server_name is localhost which is resolved to the loopback device, which then to somehow gets routed to 138.23.201.83 which resolves to biocluster.ucr.edu. The change did not give the anticipated effect. Hypothesis was not verified.

    2008-06-16 9PM

    {X} On the worker nodes syslogs are foolded with error messages. OpenSM daemons are not initialized on the correct ports. This happens because the daemon is launched before the Infiniband network is up and it binds to a wrong port.

    • (./)

      Fixed by making a separate init script

      Needs to be documented in Init Scripts


    2008-06-13 4PM

    Ran make uninstall for Maui, this removed some ".a" and ".h" files, but not everything. I also moved /var/spool/maui/ and /usr/local/maui/ to trash.

    2008-06-13 4PM

    Torque started working after launching pbs_sched. But not immediately after :/.


    2008-06-12 9PM

    Downloaded a scheduler for Torque, called Maui. Had to register my email |) Direct link: http://www.clusterresources.com/downloads/maui/maui-3.2.6p19.tar.gz

    2008-06-12 8PM

    Installed a lot of packages. Mostly compilers.

    2008-06-12 6PM

    Updated /etc/profile with some peaces of code form /etc/skel/.bashrc; Modified /etc/bash.bashrc for a limitless ~/.bash_history growth without letting the users easily turn that off.

    2008-06-12 10AM

    Found a bug. :) The admins group was missing. Re-imaging all the worker nodes... done! 12:30 PM

    2008-06-12 3AM

    Re-imaged all worked nodes with the updated /etc/passwd and /etc/group

    2008-06-12 12PM

    Updated /etc/passwd and /etc/group with the data from bioinfo LDAP. See Users and Groups for details.


    2008-06-11 2PM

    Setting up NFSv4 according to wiki.linux-nfs.org and www.crazysquirrel.com

    2008-06-11 2PM
    alevchuk@node02:~$ time ssh node01 ssh node02 ssh node03 ssh node04 ssh node05 ssh node06 ssh node07 ssh node08 ssh node10 ssh node11 ssh node12 ssh node13 ssh node14 ssh node15 ssh node16 ssh node17 ssh node18 ssh node19 ssh node20 ssh node21 ssh node22 ssh node23 ssh node24 ssh node25 ssh node26 ssh node27 ssh node28 ssh node29 ssh node31 ssh node32 hostname
    
    node32
    
    real    0m22.146s
    user    0m0.012s
    sys     0m0.000s
    • This creates a chain of kerberized SSH connections from node n to n + 1 covering all of our functional nodes. The hosname of node32 is passed to the origin back thought the entire chain.

    • Isn't that lovely?
    2008-06-11 12PM

    node09 fails to bring up Infiniband interface. Possibly a hardware problem.

    2008-06-11 12PM

    Created an offical known_hosts file for the worker nodes. Added to /etc/skel

    2008-06-10 7PM

    Updated the biocluster-infiniband init script, so that parses /etc/network/interfaces instead of parsing ifconfig. Then it updates /etc/network/interfaces and exits, just like biocluster-networking-config` on Biocluster.


    2008-06-09 11PM

    Created the Saving Disk Space page.


    2008-06-06 11PM

    :) :) :) I got the GASAPI ssh to handle both the Worksation -> Bioclster and Biocluster -> node31 kerberized connections. :_) :) :) :)


    2008-06-05 4PM

    Took out scripts to the SAN, so now there are both si_images and si_scripts. Rsync'd to backup.

    2008-06-05 4PM

    Re-imaged Owl while connected to the SAN.

    2008-06-05 2PM
    • Fixed the SAN while re-imaging problem by disabling the qla2xxx modules with rdac.
    • The trick was to do it in /usr/share/systemimager/boot/x86_64/standard/boel_binaries.tar.gz


    2008-06-04 9PM
    Re-imaged all nodes:
    1. Fixed apt sources.list on node07
    2. Took image of node07
    3. Rsync'ed with backup-bioinfo

    4. Re-imaged all the nodes, as described in Cluster Node Replication

    5. /!\ For undetermined reason node31 did not show get re-imaged and when off-line. It does make the first contact with the DHCP correctly (gets the correct IP).

      • node31 did get re-imaged after restarting systemimager-server-netbootmond. But it is not clear that there is a causation relation between the disappearance of the node31 problem and the restart of systemimager-server-netbootmond

    2008-06-04 5PM

    Re-imaged Owl:

    1. Took an image of the Biocluster from Biocluster.

    2. Committed changes to SAN:/version-control/alevchuk/owl.git

    3. Fixed a "Second Disk Device" problem! >:> See notes in: Head Node Replication/Biocluster taking an image of itself

    4. Re-imaged Owl. HOWTO is available at Head Node Replication/Re-Imaging Owl from Biocluster

    5. Took the same image again. It has the SystemImager configuration changes.
    6. Committed changes again to SAN:/version-control/alevchuk/owl.git

    7. Re-imaged Owl again, so that the SystemImager changes are synchronized.

    8. Rsync'ed SAN version control and image to bioinfo-backup

    2008-06-04 3PM

    Root's Home file tree documented. Hopefully this will be a standard that is maintained.

    2008-06-04 2PM
    • Node31 eth0 card when out of order. I re-connected the Ethernet cable to eth1 and made proper adjustments to the DHCP server.
    2008-06-04 2PM
    Cluster nodes re-imaged:
    1. Backed up version-control and images from the SAN to bioinfo-backup.

    2. On node07 added OpenSM to the biocluster-infiniband init script

    3. Took an image on node07. Saved in biovluster:/var/lib/systemimager/images/golden

    4. Committed the changes to /root/version_control/alevchuk/golden.git
    5. Re-imaged all 31 nodes (one node is in Georgia state)
    6. Rsync'ed the SAN with the new golden image and repository
    7. Rsync'ed the all images and version-control repositories to bioinfo-backup

    2008-06-04 2PM
    Changed root password on the backup server.

    2008-06-03 6PM
    Cluster nodes re-image:
    1. Re-imaged nodes [4..12] (4 thorough 12 inclucive)
    2. Rsync'd node01:/scratch to node02:/scratch (took 10 seconds)
    3. Rsync'd node01:/scratch to node04:/scratch (sudo rsync -a --delete /scratch/* node04:/scratch/)

    4. Rsync'd node02:/scratch to node05:/scratch
    5. TODO: Rsync'd new golden image and version control from Biocluster to node04:/scratch

    6. Re-imaged nodes [1..3] and [6..32].
    7. Lost the contents of /scratch on the cluster nodes because of a wrong ip allocation :( (possible a DHCPD was using a range, or I misunderstood something about the SystemImmager static IP assignment)


    2008-06-03 1AM
    Head node re-image:
    1. Rsync'ed old image and repository from node01:/scratch to node02:/scratch and to biocluster:/lib/rw/init
    2. Updated the image with si_getimage --golden-client owl

    3. Committed the changes to repository
    4. Rsync'ed the new image and repository back to node01
    5. Rsynced'ed the new image to Owl

    6. Re-imaged Biocluster.

    2008-06-02 11PM

    Updated the Biocluster Infiniband script. See Init Scripts


    2008-05-30 5PM

    Finished writing and testing the Biocluster network-configuration scrip for the head nodes. See Init Scripts


    2008-05-28 5PM

    Got rid of the "Miscellaneous Error: host not found in database", presumably because of the removal of search ucr.edu from resolve.conf. My principal alevchuk@BIOCLUSTER does not gets "incorrect password" errors for some reason.

    2008-05-28

    Working of the Storage Device Naming Problem

    2008-05-28
    Owl is now publicly accessible. A DNS Record has been requested.

    2008-05-20

    A new e1000 Intel PCI card installed on Owl. It is dedicated to one purpose: system imaging.


    2008-05-15 8PM

    The built-in IGB network cards that come in both the head node (Biocluster) and the himem/backup node (Owl) do not get recognized by the default 2.6.24 Kernel so Kevin and I had to install an old network card of mine to do the imaging of Biocluster onto Owl.

    2008-05-15 10AM

    There was a domain name change. Bioinfo-new is now Biocluster.


    2008-05-14 12PM

    IPoverIB is working on all the nodes. node01, node02, ... are now the host names for the Infiniband subnet, you can see this from a much better ping. The host names for the ethernets are now node01eth, node02eth, .... Just add the "eth" ending, without any delimiters (e.g. dashes).

    2008-05-14 12PM

    MPI is working on all of the nodes. Use mpicc and mpirun to try it out. Root access is not necessary.


    2008-04-18 5PM

    Documented our sizes of the Partitions for the OS file trees of the head and the cluster nodes.


    2008-04-17 7PM

    Published a list which defines what is a working configuration of the new cluster.


    2008-04-16 8PM

    All nodes re-imaged with the apt-key knowledge of the public key of Aleksander: Key ID is B0ADA76C. The package manager no longer complains about the absence of trust in Alex's packages.

    2008-04-16 7PM

    A log of cluster's Temperature and fan speed changes.

    2008-04-16 3PM

    Wiki's start/stop script was written. See /etc/init.d/moinmoin on bioweb.

    2008-04-16 2PM

    Cluster Wiki Launched. This website will contain the documentation for the new cluster.


    2008-04-15

    Infiniband is down on node06. No dmesg | grep Mellanox. Re-imaging again did not help.

    • Node 6 has been fixed. :)

    2008-04-15

    All nodes re-imaged with a set of custom Debian packages allowing for a successful Infiniband ping-pong

    2008-04-15

    Our Custom-Debian-Repository is launched. It is hosted on the head node so that the cluster nodes can pull the packages directly without the need of apt-proxy. The proxy works well for the official packages, but we could not get it to work with our old custom repository. At this point there are 10 packages maintained by Aleksandr. All of them are closely related to Infiniband.

    Cluster Network Topology

    UCR Cluster Network.png

    {i} The orange lines are the SAN fiber optic channels. They are not part of the TCP/IP network.

    Vector Graphics UCR Cluster Network.svg