Infrastructure Progress
News
- 2009-11-19
A must try article http://hep.kbfi.ee/index.php/IT/KernelTuning
- 2009-11-13
- Updated to libstdc++6 (4.3.2-1.1) from Lenny on headnodes and computenodes
- 2009-07-13
I stopped courier-mta on Owl to see if Thomas stops getting spam. We need to setup an independent mail server.
- 2009-07-13
- Owl got frozen due to an unknown NFS defect. This happened at 4AM, I restarted Owl at 9:30AM. Some things are becoming clear about the NFS defect, it was probably triggered by this:
owl ib1 errors WARNINGs: packets is 10452405.01 (outside range [:1]).
- Owl got frozen due to an unknown NFS defect. This happened at 4AM, I restarted Owl at 9:30AM. Some things are becoming clear about the NFS defect, it was probably triggered by this:
- 2009-07-07
- Today I created about 5 different containers for various purposes. One command that I always forget is:
sudo dpkg-reconfigure locales # and select en_US.UTF-8 UTF-8
- Today I created about 5 different containers for various purposes. One command that I always forget is:
- 2009-07-02
- On node 27 there are R processes in a Noninterruptible sleep state:
grep " D " all-processes node08 : 275 ? D 0:09 \_ [pdflush] node27 : 5200 ? D 6:55 /opt/R-2.9.0/lib64/R/bin/exec/R node27 : 30255 ? D 0:00 [R] node27 : 30320 ? D 0:00 [R] node27 : 19151 ? D 0:00 [R] node27 : 19227 ? D 0:00 [R] node27 : 32241 ? D 0:00 [R]
- On node 27 there are R processes in a Noninterruptible sleep state:
- 2009-06-11
- Some links from the USENIX09 ZFS tutorial:
- 2009-06-11
- Restarting the NFS server on Bicluster takes a very long time. Something is unhealthy about that.
Running /usr/sbin/rpc.mountd and /usr/sbin/rpc.nfsd could un-freeze the nfs-kernel restart init script.
- 2009-05-31
On Biocluster, when eth2 (192.168.2....) is down, many many problems occur Details.
- 2009-05-30
- Fstabs on Biocluster changed to
localhost:/projects /srv/projects nfs4 bg,rw,soft,intr 0 0 localhost:/home /home nfs4 bg,rw,soft,intr 0 0 localhost:/projects/home_girkelab /home_girkelab nfs4 bg,rw,soft,intr 0 0 localhost:/profound/home_bazhlab /home_bazhlab nfs4 bg,rw,soft,intr 0 0 localhost:/profound/home_sladeklab /home_sladeklab nfs4 bg,rw,soft,intr 0 0 }} 2009-05-29:: Wrote '''convert-clustalw-to-fasta'''. The name fully describes it's function. Available everywhere on Biocluster.<<BR>><<BR>> <<Anchor(stop-here)>> 2009-05-28:: T-coffee 7.81 {{{ sudo install -path=/opt/t-coffeebut nothing was written to /opt/t-coffeee, only 2 man file were written to /usr/local (/usr/local/man/man1/mafft{,-homologs}.1)
For details see : T-Coffee-install-help
- 2009-05-27
- On space1, installed lib32gcc1 on space1 which was needed for hpasmcli
- 2009-05-18
wget http://www.micans.org/mcl/src/mcl-09-116.tar.gz tar zxvf mcl-09-116.tar.gz cd mcl-09-116/ ./configure --prefix=/opt/mcl-09-116 --enable-blast make make install
--enable-blast is very important
- 2009-05-12
sudo aptitude install lm-sensors sudo sensors-detect # press enter ~10 times sudo modprobe <suggested-module> sudo modprobe <suggested-module> sudo modprobe <suggested-module> sudo vi /etc/modules # add the suggested modules sudo sensors # you should see the temperatures, and possible voltages and fan speeds # sudo aptitude install munin munin-node cd /etc/munin/plugins sudo ln -s /usr/share/munin/plugins/sensors_ sensors_temp sudo ln -s /usr/share/munin/plugins/sensors_ sensors_fan # if you have fan readings sudo ln -s /usr/share/munin/plugins/sensors_ sensors_volt # if you have voltage readings sudo /etc/init.d/munin-node restart
- Kevin, thank you for this tip
- 2009-05-06
- To free-up some space on bioweb, removed a 2005 version of PFAM (/srv/exports/PFAM-20050422112729/)
- 2009-05-06
- space2:/srv/data/NCBI/ (189G) was larger than bioweb:/srv/exports/NCBI/ (158G) because the old sysadmin duplicated the data where symlinks occurred (probably used scp instead of rsync)
- 2009-05-06
- Not being used:
- space2:/srv/data/cellwall (mounted at bioweb:/srv/web/Cellwall_space)
- space2:/srv/data/blast (mounted at bioweb:/srv/web/blast)
- space2:/srv/data/privateBlast (mounted at bioweb:/srv/web/privateBlast)
- Not being used:
- 2009-04-27
- To add the Postgres server instrumentation functions to one needs to
Install postgresql-contrib Debian package
Run psql -U ADMIN_USERNAME postgres < /usr/share/postgresql/8.3/contrib/adminpack.sql
- 2009-04-16
Did the dd if=/dev/zerro and the dd of=/dev/null filesystem performance tests on the cluster nodes:
170 (+/-5) Megabytes/second disk read speed
75 (+/-10) Megabytes/second write speed
- 2009-04-15
EMBOSS Explorer installed in an Owl container http://emboss.bioweb.ucr.edu
- 2009-04-15
Synced packages on Owl and Biocluster. Try running compare-to-owl on Biocluster
Try running compare-to-owl on Biocluster
- 2009-04-14
Wrote The Check: /usr/local/bin/check (it looks at the CPU time since the process was started, so it's the processes life-time average)
Try running: check
- 2009-04-13
- The freeze of Munin was due to owl being down. After commenting out /etc/munin/munin.conf and restarting /etc/init.d/munin-node --- everything started working again.
The first error in var/log/munin/munin-graph.log was
Apr 11 21:29:44 - Unable to graph /var/lib/munin/headnodes/owl.headnodes-if_eth0-down-c.rrd: invalid rpn expression in: idown,adown,-
The best way to diagnose Munin is still:
sudo tail -f /var/log/munin/*.log | grep -iE '(unable)|(err)|(warn)'
- 2009-03-28
- Re-imaged nodes 7,9,18,19.23,24,25,26 and ran ldconfig on all the other nodes - this concludes the Amber "ptraj" fix for Qaiser
- 2009-03-28
- Upgraded OpenMPI to v1.3.1 (2009-03-19) in biocluster:/usr/local
- 2009-02-17
Launched an OpenVZ container with Apache for serving the user homes because .htaccess functionality was needed. Look for "apache" /etc/lighttpd/front.conf and ssh apache-homes
- 2009-01-27
Created a Debian package for antiexcel 0.1.1 and added it to our Custom-Debian-Repository
- 2009-01-27
- The NAT on owl should be
sudo iptables -L -t nat
Chain PREROUTING (policy ACCEPT) target prot opt source destination DNAT tcp -- anywhere swiki tcp dpt:www to:192.168.3.221:8080 DNAT tcp -- anywhere wiki tcp dpt:www to:192.168.3.222:8080 DNAT tcp -- anywhere iwiki tcp dpt:www to:192.168.3.223:8080 DNAT tcp -- anywhere manual tcp dpt:www to:192.168.3.224:8080
sudo iptables -t nat -A PREROUTING -d 192.168.3.224/32 -p tcp -m tcp --dport 80 -j DNAT --to-destination 192.168.3.224:8080
- The NAT on owl should be
- 2009-01-27
- The load growth on Owl started around Noon of the 26th, when we were doing load experiments on the NFS server:
- The host when down with
Jan 26 12:35:05 owl kernel: nfs: server biocluster not responding, timed out Jan 26 12:36:05 owl kernel: nfs: server biocluster not responding, timed out Jan 26 12:37:05 owl kernel: nfs: server biocluster not responding, still trying Jan 26 12:38:05 owl kernel: INFO: task sh:27339 blocked for more than 120 seconds. Jan 26 12:38:05 owl kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 26 12:38:05 owl kernel: sh D ffff8109d380f090 0 27339 27337 Jan 26 12:38:05 owl kernel: ffff810ba39298f8 0000000000000082 ffff810ba39298c0 0000000000000000 Jan 26 12:38:05 owl kernel: 0000000000000000 ffff810ba39298a8 0000000000000000 ffffffff805b9200 Jan 26 12:38:05 owl kernel: ffffffff805b9200 ffffffff805b9200 ffffffff805b9200 ffffffff805b9200 Jan 26 12:38:05 owl kernel: Call Trace: Jan 26 12:38:05 owl kernel: [<ffffffffa034db76>] :sunrpc:rpc_wait_bit_killable+0x0/0x31 Jan 26 12:38:05 owl kernel: [<ffffffffa034dba0>] :sunrpc:rpc_wait_bit_killable+0x2a/0x31 Jan 26 12:38:05 owl kernel: [<ffffffff80415158>] __wait_on_bit+0x40/0x6e Jan 26 12:38:05 owl kernel: [<ffffffffa034db76>] :sunrpc:rpc_wait_bit_killable+0x0/0x31 Jan 26 12:38:05 owl kernel: [<ffffffff804151f2>] out_of_line_wait_on_bit+0x6c/0x78 Jan 26 12:38:05 owl kernel: [<ffffffff80243fe4>] wake_bit_function+0x0/0x23 Jan 26 12:38:05 owl kernel: [<ffffffffa034a62d>] :sunrpc:xprt_connect+0x89/0x124 Jan 26 12:38:05 owl kernel: [<ffffffffa034e1a2>] :sunrpc:__rpc_execute+0x139/0x2af Jan 26 12:38:05 owl kernel: [<ffffffffa03475a7>] :sunrpc:rpc_run_task+0x5a/0x62 Jan 26 12:38:05 owl kernel: [<ffffffffa0347644>] :sunrpc:rpc_call_sync+0x3e/0x5b Jan 26 12:38:05 owl kernel: [<ffffffffa03a7c69>] :nfs:nfs4_proc_access+0x147/0x1ca Jan 26 12:38:05 owl kernel: [<ffffffff8027197b>] __rmqueue_smallest+0x88/0x10a Jan 26 12:38:05 owl kernel: [<ffffffff80271a18>] __rmqueue+0x1b/0x1c1 Jan 26 12:38:05 owl kernel: [<ffffffff80271c04>] rmqueue_bulk+0x46/0x8f Jan 26 12:38:05 owl kernel: [<ffffffff802798b5>] zone_statistics+0x3c/0x90 Jan 26 12:38:05 owl kernel: [<ffffffff80273137>] get_page_from_freelist+0x48f/0x60d Jan 26 12:38:05 owl kernel: [<ffffffffa039224a>] :nfs:nfs_do_access+0x161/0x30a Jan 26 12:38:05 owl kernel: [<ffffffffa03924df>] :nfs:nfs_permission+0xec/0x15b Jan 26 12:38:05 owl kernel: [<ffffffff8029eb78>] permission+0xa9/0xf4 Jan 26 12:38:05 owl kernel: [<ffffffff8029ffdb>] __link_path_walk+0x143/0xdda Jan 26 12:38:05 owl kernel: [<ffffffff802a0cb8>] path_walk+0x46/0x8b Jan 26 12:38:05 owl kernel: [<ffffffff802a0fe2>] do_path_lookup+0x154/0x1ce Jan 26 12:38:05 owl kernel: [<ffffffff802a1af4>] __path_lookup_intent_open+0x56/0x97 Jan 26 12:38:05 owl kernel: [<ffffffff8029b23a>] open_exec+0x24/0xb2 Jan 26 12:38:05 owl kernel: [<ffffffff8027e063>] handle_mm_fault+0x3db/0x829 Jan 26 12:38:05 owl kernel: [<ffffffff80280ec2>] vma_merge+0x141/0x1ee Jan 26 12:38:05 owl kernel: [<ffffffff8029c2f8>] do_execve+0x74/0x20f Jan 26 12:38:05 owl kernel: [<ffffffff8020a47f>] sys_execve+0x35/0x4c Jan 26 12:38:05 owl kernel: [<ffffffff8020c2ca>] stub_execve+0x6a/0xc0
Fixed by rebooting.
- The load growth on Owl started around Noon of the 26th, when we were doing load experiments on the NFS server:
- 2009-01-26
- Set the IO scheduling of /home back to completely fair queuing.
sudo su -c "echo cfq > /sys/block/sdc/queue/scheduler"
A good way to see all the IO scheduling policies is:tail /sys/block/sd*/queue/scheduler
- Set the IO scheduling of /home back to completely fair queuing.
- 2009-01-20
Re-imaging node07 and node10.
- 2009-01-16
Disabling Kernel Randomization of stack pointer on node32 (it will be re-imaged to the rest of the nodes). This is in response to the warning in the NAMD output. See Worker-Node-Kernel-Configuration for details.
- 2009-01-15
Removed /usr/local/sbin/munin-generate-aggrigations and all the related junk from /etc/munin/munin.conf becouse this and this is the new generation of graphs.
- 2009-01-15
node21 got InfiniBand connection down. Same as the other cases. Caused by NAMD crash. Interestingly the namd processes were still running, but were impossible to terminate.
- 2009-01-13
node12 got InfiniBand connection down. Same as the other cases. Caused by NAMD crash.
- 2009-01-10
node07 still appeared down in qnodes -l. Fixed by restarting pbs_mom
- 2009-01-08 (night)
node07 got InfiniBand connection down. Same.
- 2009-01-07
node16 got InfiniBand connection down. Fix by rebooting. Syslog had a bunch of usuall:
Jan 7 23:07:54 node16 kernel: NETDEV WATCHDOG: ib1: transmit timed out Jan 7 23:07:54 node16 kernel: ib1: transmit timeout: latency 4126424 msecs Jan 7 23:07:54 node16 kernel: ib1: queue stopped 1, tx_head 877748, tx_tail 877684
andJan 7 23:26:00 node16 kernel: mlx4_core 0000:01:00.0: SW2HW_MPT failed (-16) Jan 7 23:26:20 node16 kernel: mlx4_core 0000:01:00.0: HW2SW_MPT failed (-16) Jan 7 23:26:40 node16 kernel: mlx4_core 0000:01:00.0: SW2HW_MPT failed (-16)
- 2009-01-07
Implemented IO-by-User Monitoring on node01
- 2009-01-07
Documented the Users-Master-List
- 2009-01-07
In biocluster:/usr/local created commit 9f561e7: Implemented a nice timeout tool: run-with-timeout
- 2009-01-06
Installed iotop on all nodes
- 2009-01-06
node11 got InfiniBand connection down. Fixed by restarting.
- 2009-01-05
Recorded all Verari Service Numbers
- 2009-01-05
- Node11 lost IB temporary.
tail -f /var/log/syslog
Jan 5 17:52:05 node11 kernel: NETDEV WATCHDOG: ib1: transmit timed out Jan 5 17:52:05 node11 kernel: ib1: transmit timeout: latency 3996704 msecs Jan 5 17:52:05 node11 kernel: ib1: queue stopped 1, tx_head 302521, tx_tail 302457
- 2009-01-03
- Node01 is still invisible to the VSM.
- 2009-01-03
- Restored VCC Database to the 2008-12-01 snapshot because a hand full of nodes in shelf 1 became invisible to the VSM.
- 2008-12-20
node28 has IB packet loss
- 2008-12-30
node07 and node20 have their InfiniBand connections down. They were both running Quaiser's tasks. Fixed by restarting.
- In syslog both nodes are printing the following once per second with slightly different values:
kernel: NETDEV WATCHDOG: ib1: transmit timed out kernel: ib1: transmit timeout: latency 420676572 msecs kernel: ib1: queue stopped 1, tx_head 242070168, tx_tail 242070104
- In syslog both nodes are printing the following once per second with slightly different values:
- 2008-12-30
Backup ran out of space, becouse a large amount of files were copied from /home to /profound and then deleted from /home for mbazhenov.
- 2008-12-29
I accidentally gave Owl the ip of Biocluster (placed it's MAC address as identified of Biocluster in /etc/biocluster
). This started kicking users out of Biocluster and placing them into Owl when they tried to logging in. - 2008-12-23
- From LSI SANtricity software
Drive at Tray 0, Slot 5 Status: Failed Mode: Assigned Raw capacity: 279.397 GB Usable capacity: 278.897 GB World-wide identifier: 20:00:00:1d:38:58:48:ca:00:00:00:00:00:00:00:00 Associated volume group: 1 Port 0; Channel 1; ID 35/0xAD Port 1; Channel 2; ID 18/0xCB Drive path redundancy: OK Drive type: Fibre Channel Speed: 15015 RPM Current data rate: 4 Gbps Product ID: ST3300655FC Firmware version: MS08 Serial number: 3LM3MWHP00009828SWN9 Vendor: SEAGATE Date of manufacture: February 21, 2008
- 2008-12-23
- Owl had a problem when shutting down. IB's stop rc script runs before NFS stop rc script, so NFS freezes and halts the entire shutdown sequence.
- 2008-12-22
- Node09 does not respond the ping on the IB network. Eth network is fine.
Looks like a node that dies after a failed NAMD submission. Fixed by rebooting.
Also had to restart pbs_mom after booting.
- 2008-12-21
- 2008-12-21
- Node28 does not respond the ping on the IB network. Eth network is fine.
Looks like a node that dies after a failed NAMD submission. Fixed by rebooting and placed back into queuing system.
- 2008-12-19
- Node26 and Node19 have serious ping problems.
- 2008-12-19
- Node26 lost NFS.
- 2008-12-19
Nodes 14,15,16,17(possibly), and 18 (less possibly) had qstat status E for the tasks that were on them. This is becouse I temporary disconnected them form the IB network (did not suspect that Tourqe would be that sensitive).
- 2008-12-17
Node27 was not responding (even to Keyboard and Monitor). Showed I/O errors on sda. After restart BIOS could not locate the hard drive. I opened the node and swapped the SATA cable that was in the drive with the other free one that was in the node. It is now working. But I had to restarted pbs_mom to get qnodes state change from DOWN to FREE.
- 2008-12-16
Restarted pbs_mom on node21 to get it to change state (per qnodes) from DOWN to FREE.
- 2008-12-15
In /etc/munin/plugin-conf.d/munin-node the list of users for head node CPU accounting is set to
- afatmi
- alevchuk
- ebolotin
- lgao
- qwu
- root
- tgirke
- xpcui
- 2008-12-15
/etc/munin/plugins/cpu and /etc/munin/plugins/cpubyuser were updated to adjust how the graph is displayed.
- 2008-12-15
Started using Munin to perform the quota monitoring. Added Thomas and I are on the notification list. The standard quota package is only used as an accounting mechanism for quota, it does not have any soft or hard limits. The equivalent of "soft" limits are set in /etc/munin/munin.conf but it only sends out notifications and does not have any grace period.
- 2008-12-15
- On the Dell workstations (core3, core4, batch2145d, keenhall1008), bocked the monitoring systems warnings for the S.M.A.R.T. values, becouse they seem to be irrelevant and the following command reports PASSED
sudo smartctl -H /dev/hda3
- On the Dell workstations (core3, core4, batch2145d, keenhall1008), bocked the monitoring systems warnings for the S.M.A.R.T. values, becouse they seem to be irrelevant and the following command reports PASSED
- 2008-12-15
- On Biocluster, replaced vim-tiny with the full version:
[INSTALL] vim-full [INSTALL] vim-latexsuite [INSTALL] vim-lesstif [INSTALL] vim-perl [INSTALL] vim-python [INSTALL] vim-ruby [INSTALL] vim-scripts [INSTALL] vim-tcl [REMOVE] vim-tiny
- On Biocluster, replaced vim-tiny with the full version:
- 2008-12-15
On Biocluster, did Python 2.4 and Perl 5.8 minor Debian updates with aptitude:
[UPGRADE] libperl5.8 5.8.8-7etch3 -> 5.8.8-7etch5 [UPGRADE] perl 5.8.8-7etch3 -> 5.8.8-7etch5 [UPGRADE] perl-base 5.8.8-7etch3 -> 5.8.8-7etch5 [UPGRADE] perl-doc 5.8.8-7etch3 -> 5.8.8-7etch5 [UPGRADE] perl-modules 5.8.8-7etch3 -> 5.8.8-7etch5 [UPGRADE] python2.4 2.4.4-3+etch1 -> 2.4.4-3+etch2 [UPGRADE] python2.4-dev 2.4.4-3+etch1 -> 2.4.4-3+etch2 [UPGRADE] python2.4-minimal 2.4.4-3+etch1 -> 2.4.4-3+etch2
- 2008-12-15
- The Wiki container backed-up.
- 2008-12-9
The middle shelf on the Biocluster's SAN shows a
(exclamation mark in a triangle)
- 2008-12-9
Looks like node21 was knocked out by NAMD. Rebooting via Verari web control panel..
- 2008-12-8
Installed /home/khoran/downloads/pbzip2_1.0.3-1_amd64.deb on Biocluster, node02, and node32. This is a parallel bzip compressor.
- 2008-12-6
- Undocumented command from the past (access to the disk IO scheduler policy):
sudo cat /sys/block/sd{a,b,c,d}/queue/scheduler
- Undocumented command from the past (access to the disk IO scheduler policy):
- 2008-11-27
- 2008-11-27
Wrote the namd-kill-all-jobs script. It's ancestor is namd-kill-job. Both scripts are located in /usr/local/bin/
- 2008-11-26
First short-term TODO list: TODO List 1a
- 2008-11-20
In the last 7 days there were some changes, problems, and solutions related to NFS.
- 2008-11-18
For a still unknown reason, OpenVZ checkpointing stopped working correctly after the recent reboot of Owl
- 2008-11-18
- The number of NFS threads was increased from 80 to 160, becouse all 80 were staying busy.
- After the NFS restart, the number of busy threads went up to 134. The current "load" of the system is 143.
- All compute nodes are still very slow on IO system calls, but fortunately no "Stale NFS handles" are observed.
- After about one hour, the number of busy NFS threads dropped to 0.
- 2008-11-18
Nodes 4-to-23 and 27-to-32 were re-imaged. All compute-nodes except node12, node13, and node19 are available for task submission (using qsub).
- 2008-11-12
Installed all software on http://molpopgen.org/software/lseqsoftware.html which are using libsequence which was also installed recently
- 2008-11-01
On the ackups host /var usage spikes stopped after I added PRUNEPATHS="/srv /var/lib/vz" to /etc/updatedb.conf. Maybe this should be re-enabled later - so that users can search their backups.
- 2008-10-23
- Apporx Debian mirror changed
- 2008-10-21
- Biocluster had symptoms similar to the Oct 17th crash, but I prevented the crash by terminating the process that was using 98% of the memory. The process was an R script that one of our researchers was running.
- 2008-10-17
Biocluster 2008-10 Crash soon after I re-imaged all nodes.
- 2008-09
Found the OpenSM problem
- 2008-09-03
- Uninstalled php4-cgi and php4-common from Biocluster
- 2008-09-03
Created the namd-kill-job and namd-start. See Using NAMD.
- 2008-09-02
Created the ping-ibs script. See Pinging All Infiniband Cards.
- 2008-08-27
Work related bookmarks of Aleksandr Levchuk can be found here: http://ihooh.com/tags/?id=75146483
- 2008-08-24
- After rebooting all the nodes, node27 did not boot. After rebooting 2 times it when back on line.
- 2008-08-23
- The Verari Service Module will power down nodes even if you say POWER UP on the nodes that are already booted. I learned it the wrong way.
- 2008-08-22
- One broken RAM stick can fool you into replacing the motherboard. In some slots a broken RAM stick can stall the Verari nodes so that they will never reaching any BIOS. A broken RAM stick in other slots does not cause this failure. I could not figure out which exact slots do cause this, but the every-4rth blue slot does not do the trick. Placing the broken stick into 2 nodes (3 different motherboards), made it seem that the killer slot is somewhere around the second blue slot (counting from the front panel on the node).
- 2008-08-19
Discovered that in order for ulimit to take effect for all users in must be set by root and then su into a user. In other words set it in the script that launched pbs_mom.
- 2008-08-15
- Changed /home from XFS to EXT3.
- 2008-08-06
Documented the Creating a New Container procedure
- 2008-08-02
Documented the Making a Template from a Container procedure
- 2008-08-01
Documented the Version Controlling the Biobuntu Template procedure
- 2008-08-01
Wrote the wget-to-destination script. It is good for putting files into containers that do not have their own Internet.
- 2008-07-30
- I found an interesting behavior of Bash:
# 1. `echo; sleep 10` #(include the back-ticks) # 2. Press C-z.
This will lock you out of your shell. The normal C-c will not work. The only way to get out it is to run a kill from another shell.
Tested on Debian 4 and !RedHat 5. - 2008-07-30
- The Perl's package manager CPAN upgraded to script version 1.9, CPAN.pm version 1.9205
- 2008-07-28
Some nodes still have not been re-imaged. Check with on-all-nodes-run sudo aptitude show libc6 \| grep Version
- 2008-07-22
On Biocluster the files in .html/ in homes are now served as static web content. For example http://biocluster.ucr.edu/~tgirke/
- 2008-07-18
Node26 is down due to packet look in the infiniband link. - 2008-07-22
Fixed
- 2008-07-17
There is a hard to reproduce problem with /etc/init.d/biocluster-infiniband on the compute nodes. The script it adds duplicate ib1 entries to /etc/network/interfaces. Although I re-imaged and re-booted about 20 nodes at different time, this problem happened only twice in that session.
- 2008-07-16
Looks like NTP server needs to be stopped before you can run sudo ntpdate-debian, otherwise you will get the NTP socket is in use, exiting error. So, Biocluster is our only NTP server. This will be OK for the projected implementation of the Owl fail-over for the cause of hardware failure in Biocluster, because Owl will become an absolute replacement an will launch the NTP server.
- 2008-07-12
- 2008-07-08
Figured out why my script on-all-nodes-run was waiting for some nodes indefinably. Re-imaging all free nodes to fix the problem. Manually fixing of the busy nodes 1, 2, 4, 5, and 14.
See SSH Fails to Close Connection for the description of the problem.
- 2008-07-06
While Kerberos is working correctly on all nodes, Node07 started asking for password when SSHing from Biocluster. on-all-nodes-run was working fine on all nodes (except 09 and 30) on Thursday.
Figured out: auth.log: Unknown code krb5 37 - This means Clock skew too great
See Kerberos Codes
- This means Clock skew too great
- 2008-07-03
- Network switched to the 10 Gigabit Ethernet in the Data Center
- 2008-07-01
Documented the Resetting Torque system administrating procedure.
- 2008-07-01
Documented the Reserving Nodes system administrating procedure.
- 2008-07-01
All nodes (except node27 and node28) re-imaged and ready for centralized software installation. To install software please follow the procedure documented in Software in Local.
- 2008-06-30
qsub and all the other torque operations became available on the worker nodes. It just happened as a side effect after centralizing /usr/local
- 2008-06-30
- Fixed Ethernet connections for node28 and node31. On both nodes the eth0 ports are broken. Had to reconnect the cables to eth1, update the hardware MAC addresses in /etc/dhcd3/dhcpd.conf, and re-image the nodes.
- 2008-06-27
Fixed torque by installing not as a package. This fixes xpbmon. Made the jobs of other visible with qstat. For details see Queuing with Torque
- 2008-06-25
Installed the blast2 package on all the nodes. Fixed the permissions on the /scratch partition.
- 2008-06-24
This afternoon we went into production. The following Announcement was sent out to all our users.
- 2008-06-20 8AM
Reverted bioinfo back to what it was. There is an issue with mounting Biocluster's fs on bioinfo's nodes.
- 2008-06-20 8AM
Re-configured bioinfo so that it mounts the home of Biocluster
- 2008-06-20 8AM
On bioinfo, moved /srv/exports/home to /srv/exports/home-bioinfo
- 2008-06-20 7AM
Brought down NFS server on bioinfo
- 2008-06-20 2AM
Biocluster re-imaged. Came up with the following issues:
SAN not mounted
What is really needed is a good fstab for Biocluster. See below. Fixed Temporarily!
# /etc/fstab: static file system information. # # <file system> <mount point> <type> <options> <dump> <pass> proc /proc proc defaults 0 0 /dev/sda2 / xfs defaults 0 1 /dev/sda1 /boot ext3 defaults 0 2 /dev/sda7 /tmp xfs defaults 0 2 /dev/sda5 /usr xfs defaults 0 2 /dev/sda6 /var xfs defaults 0 2 /dev/sda8 /scratch xfs defaults 0 2 LABEL="san/projects" /srv/projects xfs defaults 0 2 LABEL="san/homes" /home xfs defaults 0 2 LABEL="san/profound" /profound xfs defaults 0 2 /srv/projects /srv/nfs4exports/projects none rw,bind 0 0 /home /srv/nfs4exports/home none rw,bind 0 0 /profound /srv/nfs4exports/profound none rw,bind 0 0 /dev/sda9 none swap sw 0 0 /dev/hdb /media/cdrom0 udf,iso9660 user,noauto 0 0
KDC down
the init scripts krb5-kdc and krb5-admin-server were looking for the old /usr/local binaries. Fixed Permanently!
biocluster172 hostname
This is result form systemimager. That's why completing /etc/init.d/biocluster-network-config is important. The final hostname must be biocluster.ucr.edu. Fixed Temporarily!
Never appeared in /var/lib/systemimager/clients.xml
I'm not sure what happened there. I thought I configured the monitoring correctly.
- 2008-06-19 6PM
Owl re-imaged. The /home, /profound, and /srv/projects NFS mounts were not present in /etc/fstabs, of course.
- 2008-06-19 12PM
Fixed the "Name to ID mapping"/"Large Groups" problem by upgrading to libnfsidmap-0.20. The upgrade was invisible to the package manager (by simply overwriting /usr/lib/libnfsidmap.so.0.2.0 on the worker nodes and the head nodes (the older version was 40008 bytes; the new one is 44240 bytes)). We did this because we are anticipating the same fix to occur in the upgrade to Lenny. See: /root/src/alevchuk/second-quarter/ and Users and Groups
- 2008-06-18 5PM
I was going to but did not obviated the "Name to ID mapping problem" in NFSv4. See Users and Groups of details. Also see my mailing list follow-up http://linux-nfs.org/pipermail/nfsv4/2008-June/008814.html
- 2008-06-18 9AM
Question about the NFSv4 "nogroup" problem posted: http://linux-nfs.org/pipermail/nfsv4/2008-June/008803.html
- 2008-06-17 3PM
Updating /etc/passwd and /etc/group on Biocluster and Owl.
- 2008-06-17 12PM
After changing the hostname of Biocluster from biocluster.ucr.edu to biocluster Kerberos SSH connection stopped working. The /var/log/auth.log says debug1: Got no client credentials and the connection gets closed immediately. This could be a hard to find problem.
- 2008-06-16 9PM
node09 fails to bring up Infiniband interface. sudo cat /var/log/dmesg | grep Mel gives empty output. - 2008-06-16 9PM
qsub -I does not work because pbs_mom on the worker nodes cannot open a connection back to pbs_server on the head node. This happens because the client tries to connect to biocluster.ucr.edu.
I verified that adding biocluster.ucr.edu at the end of the 192.168.3.21 line in /etc/hosts on node of the worker nodes fixes this issue.
I tried to fix it without having the ucr.edu entry on the internal network by changing /var/spool/torque/server_name from localhost to biocluster. This would verify the hypothesis that pbs_mom on the worker nodes is tying biocluster.ucr.edu because /var/spool/torque/server_name is localhost which is resolved to the loopback device, which then to somehow gets routed to 138.23.201.83 which resolves to biocluster.ucr.edu. The change did not give the anticipated effect. Hypothesis was not verified.
- 2008-06-16 9PM
On the worker nodes syslogs are foolded with error messages. OpenSM daemons are not initialized on the correct ports. This happens because the daemon is launched before the Infiniband network is up and it binds to a wrong port.
Fixed by making a separate init script
Needs to be documented in Init Scripts
- 2008-06-13 4PM
Ran make uninstall for Maui, this removed some ".a" and ".h" files, but not everything. I also moved /var/spool/maui/ and /usr/local/maui/ to trash.
- 2008-06-13 4PM
Torque started working after launching pbs_sched. But not immediately after :/.
- 2008-06-12 9PM
Downloaded a scheduler for Torque, called Maui. Had to register my email
Direct link: http://www.clusterresources.com/downloads/maui/maui-3.2.6p19.tar.gz - 2008-06-12 8PM
Installed a lot of packages. Mostly compilers.
- 2008-06-12 6PM
Updated /etc/profile with some peaces of code form /etc/skel/.bashrc; Modified /etc/bash.bashrc for a limitless ~/.bash_history growth without letting the users easily turn that off.
- 2008-06-12 10AM
Found a bug.
The admins group was missing. Re-imaging all the worker nodes... done! 12:30 PM - 2008-06-12 3AM
Re-imaged all worked nodes with the updated /etc/passwd and /etc/group
- 2008-06-12 12PM
Updated /etc/passwd and /etc/group with the data from bioinfo LDAP. See Users and Groups for details.
- 2008-06-11 2PM
Setting up NFSv4 according to wiki.linux-nfs.org and www.crazysquirrel.com
- 2008-06-11 2PM
alevchuk@node02:~$ time ssh node01 ssh node02 ssh node03 ssh node04 ssh node05 ssh node06 ssh node07 ssh node08 ssh node10 ssh node11 ssh node12 ssh node13 ssh node14 ssh node15 ssh node16 ssh node17 ssh node18 ssh node19 ssh node20 ssh node21 ssh node22 ssh node23 ssh node24 ssh node25 ssh node26 ssh node27 ssh node28 ssh node29 ssh node31 ssh node32 hostname node32 real 0m22.146s user 0m0.012s sys 0m0.000s
This creates a chain of kerberized SSH connections from node n to n + 1 covering all of our functional nodes. The hosname of node32 is passed to the origin back thought the entire chain.
- Isn't that lovely?
- 2008-06-11 12PM
node09 fails to bring up Infiniband interface. Possibly a hardware problem.
- 2008-06-11 12PM
Created an offical known_hosts file for the worker nodes. Added to /etc/skel
- 2008-06-10 7PM
Updated the biocluster-infiniband init script, so that parses /etc/network/interfaces instead of parsing ifconfig. Then it updates /etc/network/interfaces and exits, just like biocluster-networking-config` on Biocluster.
- 2008-06-09 11PM
Created the Saving Disk Space page.
- 2008-06-06 11PM
I got the GASAPI ssh to handle both the Worksation -> Bioclster and Biocluster -> node31 kerberized connections. :_)
- 2008-06-05 4PM
Took out scripts to the SAN, so now there are both si_images and si_scripts. Rsync'd to backup.
- 2008-06-05 4PM
Re-imaged Owl while connected to the SAN.
- 2008-06-05 2PM
-
- Fixed the SAN while re-imaging problem by disabling the qla2xxx modules with rdac.
The trick was to do it in /usr/share/systemimager/boot/x86_64/standard/boel_binaries.tar.gz
- 2008-06-04 9PM
- Re-imaged all nodes:
- Fixed apt sources.list on node07
- Took image of node07
Rsync'ed with backup-bioinfo
Re-imaged all the nodes, as described in Cluster Node Replication
For undetermined reason node31 did not show get re-imaged and when off-line. It does make the first contact with the DHCP correctly (gets the correct IP). node31 did get re-imaged after restarting systemimager-server-netbootmond. But it is not clear that there is a causation relation between the disappearance of the node31 problem and the restart of systemimager-server-netbootmond
- 2008-06-04 5PM
Re-imaged Owl:
Took an image of the Biocluster from Biocluster.
Committed changes to SAN:/version-control/alevchuk/owl.git
Fixed a "Second Disk Device" problem!
See notes in: Head Node Replication/Biocluster taking an image of itself Re-imaged Owl. HOWTO is available at Head Node Replication/Re-Imaging Owl from Biocluster
- Took the same image again. It has the SystemImager configuration changes.
Committed changes again to SAN:/version-control/alevchuk/owl.git
Re-imaged Owl again, so that the SystemImager changes are synchronized.
Rsync'ed SAN version control and image to bioinfo-backup
- 2008-06-04 3PM
Root's Home file tree documented. Hopefully this will be a standard that is maintained.
- 2008-06-04 2PM
-
- Node31 eth0 card when out of order. I re-connected the Ethernet cable to eth1 and made proper adjustments to the DHCP server.
- 2008-06-04 2PM
- Cluster nodes re-imaged:
Backed up version-control and images from the SAN to bioinfo-backup.
On node07 added OpenSM to the biocluster-infiniband init script
Took an image on node07. Saved in biovluster:/var/lib/systemimager/images/golden
- Committed the changes to /root/version_control/alevchuk/golden.git
- Re-imaged all 31 nodes (one node is in Georgia state)
- Rsync'ed the SAN with the new golden image and repository
Rsync'ed the all images and version-control repositories to bioinfo-backup
- 2008-06-04 2PM
- Changed root password on the backup server.
- 2008-06-03 6PM
- Cluster nodes re-image:
- Re-imaged nodes [4..12] (4 thorough 12 inclucive)
- Rsync'd node01:/scratch to node02:/scratch (took 10 seconds)
Rsync'd node01:/scratch to node04:/scratch (sudo rsync -a --delete /scratch/* node04:/scratch/)
- Rsync'd node02:/scratch to node05:/scratch
TODO: Rsync'd new golden image and version control from Biocluster to node04:/scratch
- Re-imaged nodes [1..3] and [6..32].
Lost the contents of /scratch on the cluster nodes because of a wrong ip allocation
(possible a DHCPD was using a range, or I misunderstood something about the SystemImmager static IP assignment)
- 2008-06-03 1AM
- Head node re-image:
- Rsync'ed old image and repository from node01:/scratch to node02:/scratch and to biocluster:/lib/rw/init
Updated the image with si_getimage --golden-client owl
- Committed the changes to repository
- Rsync'ed the new image and repository back to node01
Rsynced'ed the new image to Owl
Re-imaged Biocluster.
- 2008-06-02 11PM
Updated the Biocluster Infiniband script. See Init Scripts
- 2008-05-30 5PM
Finished writing and testing the Biocluster network-configuration scrip for the head nodes. See Init Scripts
- 2008-05-28 5PM
Got rid of the "Miscellaneous Error: host not found in database", presumably because of the removal of search ucr.edu from resolve.conf. My principal alevchuk@BIOCLUSTER does not gets "incorrect password" errors for some reason.
- 2008-05-28
Working of the Storage Device Naming Problem
- 2008-05-28
- Owl is now publicly accessible. A DNS Record has been requested.
- 2008-05-20
A new e1000 Intel PCI card installed on Owl. It is dedicated to one purpose: system imaging.
- 2008-05-15 8PM
The built-in IGB network cards that come in both the head node (Biocluster) and the himem/backup node (Owl) do not get recognized by the default 2.6.24 Kernel so Kevin and I had to install an old network card of mine to do the imaging of Biocluster onto Owl.
- 2008-05-15 10AM
There was a domain name change. Bioinfo-new is now Biocluster.
- 2008-05-14 12PM
IPoverIB is working on all the nodes. node01, node02, ... are now the host names for the Infiniband subnet, you can see this from a much better ping. The host names for the ethernets are now node01eth, node02eth, .... Just add the "eth" ending, without any delimiters (e.g. dashes).
- 2008-05-14 12PM
MPI is working on all of the nodes. Use mpicc and mpirun to try it out. Root access is not necessary.
- 2008-04-18 5PM
Documented our sizes of the Partitions for the OS file trees of the head and the cluster nodes.
- 2008-04-17 7PM
Published a list which defines what is a working configuration of the new cluster.
- 2008-04-16 8PM
All nodes re-imaged with the apt-key knowledge of the public key of Aleksander: Key ID is B0ADA76C. The package manager no longer complains about the absence of trust in Alex's packages.
- 2008-04-16 7PM
A log of cluster's Temperature and fan speed changes.
- 2008-04-16 3PM
Wiki's start/stop script was written. See /etc/init.d/moinmoin on bioweb.
- 2008-04-16 2PM
Cluster Wiki Launched. This website will contain the documentation for the new cluster.
- 2008-04-15
Infiniband is down on node06. No dmesg | grep Mellanox. Re-imaging again did not help.
Node 6 has been fixed.
- 2008-04-15
All nodes re-imaged with a set of custom Debian packages allowing for a successful Infiniband ping-pong
- 2008-04-15
Our Custom-Debian-Repository is launched. It is hosted on the head node so that the cluster nodes can pull the packages directly without the need of apt-proxy. The proxy works well for the official packages, but we could not get it to work with our old custom repository. At this point there are 10 packages maintained by Aleksandr. All of them are closely related to Infiniband.
Cluster Network Topology
|
Vector Graphics UCR Cluster Network.svg
Site Navigation
- RecentChanges
- TitleIndex
- WordIndex
- FindPage
- WantedPages
- OrphanedPages
- AbandonedPages
- RandomPage
- PageSize
- PageHits