101-150 of 1421 results (9ms)
2018-06-30 §
16:35 <zhuyifei1999_> `root@tools-paws-master-01:~# sed -i 's/^labstore1006.wikimedia.org/#labstore1006.wikimedia.org/' /etc/fstab`
16:34 <andrewbogott> "sed -i '/labstore1006/d' /etc/fstab" everywhere
2018-06-29 §
17:41 <bd808> Rescheduling continuous jobs away from tools-exec-1408 where load is high
17:11 <bd808> Rescheduled jobs away from toole-exec-1404 where linkwatcher is currently stealing most of the CPU (T123121)
16:46 <bd808> Killed orphan tool owned processes running on the job grid. Mostly jembot and wsexport php-cgi processes stuck in deadlock following an OOM. T182070
2018-06-28 §
19:50 <chasemp> tools-clushmaster-01:~$ clush -w @all 'sudo umount -fl /mnt/nfs/dumps-labstore1006.wikimedia.org'
18:02 <chasemp> tools-clushmaster-01:~$ clush -w @all "sudo umount -fl /mnt/nfs/dumps-labstore1007.wikimedia.org"
17:53 <chasemp> tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --disable 'labstore1007 outage'"
17:20 <chasemp> tools-worker-1007:~# /sbin/reboot
16:48 <arturo> rebooting tools-docker-registry-01
16:42 <andrewbogott> rebooting tools-worker- to get NFS unstuck
16:40 <andrewbogott> rebooting tools-worker-1012 and tools-worker-1015 to get their nfs mounts unstuck
2018-06-21 §
13:18 <chasemp> tools-bastion-03:~# bash -x /data/project/paws/paws-userhomes-hack.bash
2018-06-20 §
15:09 <bd808> Killed orphan processes on webgrid nodes (T182070); most owned by jembot and croptool
2018-06-14 §
14:20 <chasemp> timeout 180s bash -x /data/project/paws/paws-userhomes-hack.bash
2018-06-11 §
10:11 <arturo> T196137 `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo wc -l /var/log/exim4/paniclog 2>/dev/null | grep -v ^0 && sudo rm -rf /var/log/exim4/paniclog && sudo service prometheus-node-exporter restart || true'`
2018-06-08 §
07:46 <arturo> T196137 more rootspam today, restarting again `prometheus-node-exporter` and force rotating exim4 paniclog in 12 nodes
2018-06-07 §
11:01 <arturo> T196137 force rotate all exim panilog files to avoid rootspam `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo logrotate /etc/logrotate.d/exim4-paniclog -f -v'`
2018-06-06 §
22:00 <bd808> Scripting a restart of webservice for tools that are still in CrashLoopBackOff state after 2nd attempt (T196589)
21:10 <bd808> Scripting a restart of webservice for 59 tools that are still in CrashLoopBackOff state after last attempt (P7220)
20:25 <bd808> Scripting a restart of webservice for 175 tools that are in CrashLoopBackOff state (P7220)
19:04 <chasemp> tools-bastion-03 is virtually unusable
09:49 <arturo> T196137 aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo service prometheus-node-exporter restart' <-- procs using the old uid
2018-06-05 §
18:02 <bd808> Forced puppet run on tools-bastion-03 to re-enable logins by dubenben (T196486)
17:39 <arturo> T196137 clush: delete `prometheus` user and re-create it locally. Then, chown prometheus dirs
17:38 <bd808> Added grid engine quota to limit user debenben to 2 concurrent jobs (T196486)
2018-06-04 §
10:28 <arturo> T196006 installing sqlite3 package in exec nodes
2018-06-03 §
10:19 <zhuyifei1999_> Grid is full. qdel'ed all jobs belonging to tools.dibot except lighttpd, and tools.mbh that has a job name starting 'comm_delin', 'delfilexcl' T195834
2018-05-31 §
11:31 <zhuyifei1999_> building & pushing python/web docker image T174769
11:13 <zhuyifei1999_> force puppet run on tools-worker-1001 to check the impact of https://gerrit.wikimedia.org/r/#/c/433101
2018-05-30 §
10:52 <zhuyifei1999_> undid both changes to tools-bastion-05
10:50 <zhuyifei1999_> also making /proc/sys/kernel/yama/ptrace_scope 0 temporarily on tools-bastion-05
10:45 <zhuyifei1999_> installing mono-runtime-dbg on tools-bastion-05 to produce debugging information; was previously installed on tools-exec-1413 & 1441. Might be a good idea to uninstall them once we can close T195834
2018-05-28 §
12:09 <arturo> T194665 adding mono packages to apt.wikimedia.org for jessie-wikimedia and stretch-wikimedia
12:06 <arturo> T194665 adding mono packages to apt.wikimedia.org for trusty-wikimedia
2018-05-25 §
05:31 <zhuyifei1999_> Edit /data/project/.system/gridengine/default/common/sge_request, h_vmem 256M -> 512M, release precise -> trusty T195558
2018-05-22 §
11:53 <arturo> running puppet to deploy https://gerrit.wikimedia.org/r/#/c/433996/ for T194665 (mono framework update)
2018-05-18 §
16:36 <bd808> Restarted bigbrother on tools-services-02
2018-05-16 §
21:01 <zhuyifei1999_> maintain-kubeusers on stuck in infinite sleeps of 10 seconds
2018-05-15 §
04:28 <andrewbogott> depooling, rebooting, re-pooling tools-exec-1414. It's hanging for unknown reasons.
04:07 <zhuyifei1999_> Draining unresponsive tools-exec-1414 following Portal:Toolforge/Admin#Draining_a_node_of_Jobs
04:05 <zhuyifei1999_> Force deletion of grid job 5221417 (tools.giftbot sga), host tools-exec-1414 not responding
2018-05-12 §
10:09 <Hauskatze> tools.quentinv57-tools@tools-bastion-02:~$ webservice stop | T194343
2018-05-11 §
14:34 <andrewbogott> repooling labvirt1001 tools instances
13:59 <andrewbogott> depooling a bunch of things before rebooting labvirt1001 for T194258: tools-exec-1401 tools-exec-1407 tools-exec-1408 tools-exec-1430 tools-exec-1431 tools-exec-1432 tools-exec-1435 tools-exec-1438 tools-exec-1439 tools-exec-1441 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
2018-05-10 §
18:55 <andrewbogott> depooling, rebooting, repooling tools-exec-1401 to test a kernel update
2018-05-09 §
21:11 <Reedy> Added Tim Starling as member/admin
2018-05-07 §
21:02 <zhuyifei1999_> re-building all docker images T190893
20:48 <zhuyifei1999_> building, signing, and publishing toollabs-webservice 0.39 T190893
00:25 <zhuyifei1999_> `renice -n 15 -p 28865` (`tar cvzf` of `tools.giftbot`) on tools-bastion-02, been hogging the NFS IO for a few hours