1-50 of 1324 results (11ms)
2018-06-30 §
18:15<chicocvenancio>pushed new config to PAWS to fix dumps nfs mountpoint
16:40<zhuyifei1999_>because tools-paws-master-01 was having ~1000 loadavg due to NFS having issues and processes stuck in D state
16:39<zhuyifei1999_>reboot tools-paws-master-01
16:35<zhuyifei1999_>`root@tools-paws-master-01:~# sed -i 's/^labstore1006.wikimedia.org/#labstore1006.wikimedia.org/' /etc/fstab`
16:34<andrewbogott>"sed -i '/labstore1006/d' /etc/fstab" everywhere
2018-06-29 §
17:41<bd808>Rescheduling continuous jobs away from tools-exec-1408 where load is high
17:11<bd808>Rescheduled jobs away from toole-exec-1404 where linkwatcher is currently stealing most of the CPU (T123121)
16:46<bd808>Killed orphan tool owned processes running on the job grid. Mostly jembot and wsexport php-cgi processes stuck in deadlock following an OOM. T182070
2018-06-28 §
19:50<chasemp>tools-clushmaster-01:~$ clush -w @all 'sudo umount -fl /mnt/nfs/dumps-labstore1006.wikimedia.org'
18:02<chasemp>tools-clushmaster-01:~$ clush -w @all "sudo umount -fl /mnt/nfs/dumps-labstore1007.wikimedia.org"
17:53<chasemp>tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --disable 'labstore1007 outage'"
17:20<chasemp>tools-worker-1007:~# /sbin/reboot
16:48<arturo>rebooting tools-docker-registry-01
16:42<andrewbogott>rebooting tools-worker- to get NFS unstuck
16:40<andrewbogott>rebooting tools-worker-1012 and tools-worker-1015 to get their nfs mounts unstuck
2018-06-21 §
13:18<chasemp>tools-bastion-03:~# bash -x /data/project/paws/paws-userhomes-hack.bash
2018-06-20 §
15:09<bd808>Killed orphan processes on webgrid nodes (T182070); most owned by jembot and croptool
2018-06-14 §
14:20<chasemp>timeout 180s bash -x /data/project/paws/paws-userhomes-hack.bash
2018-06-11 §
10:11<arturo>T196137 `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo wc -l /var/log/exim4/paniclog 2>/dev/null | grep -v ^0 && sudo rm -rf /var/log/exim4/paniclog && sudo service prometheus-node-exporter restart || true'`
2018-06-08 §
07:46<arturo>T196137 more rootspam today, restarting again `prometheus-node-exporter` and force rotating exim4 paniclog in 12 nodes
2018-06-07 §
11:01<arturo>T196137 force rotate all exim panilog files to avoid rootspam `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo logrotate /etc/logrotate.d/exim4-paniclog -f -v'`
2018-06-06 §
22:00<bd808>Scripting a restart of webservice for tools that are still in CrashLoopBackOff state after 2nd attempt (T196589)
21:10<bd808>Scripting a restart of webservice for 59 tools that are still in CrashLoopBackOff state after last attempt (P7220)
20:25<bd808>Scripting a restart of webservice for 175 tools that are in CrashLoopBackOff state (P7220)
19:04<chasemp>tools-bastion-03 is virtually unusable
09:49<arturo>T196137 aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo service prometheus-node-exporter restart' <-- procs using the old uid
2018-06-05 §
18:02<bd808>Forced puppet run on tools-bastion-03 to re-enable logins by dubenben (T196486)
17:39<arturo>T196137 clush: delete `prometheus` user and re-create it locally. Then, chown prometheus dirs
17:38<bd808>Added grid engine quota to limit user debenben to 2 concurrent jobs (T196486)
2018-06-04 §
10:28<arturo>T196006 installing sqlite3 package in exec nodes
2018-06-03 §
10:19<zhuyifei1999_>Grid is full. qdel'ed all jobs belonging to tools.dibot except lighttpd, and tools.mbh that has a job name starting 'comm_delin', 'delfilexcl' T195834
2018-05-31 §
11:31<zhuyifei1999_>building & pushing python/web docker image T174769
11:13<zhuyifei1999_>force puppet run on tools-worker-1001 to check the impact of https://gerrit.wikimedia.org/r/#/c/433101
2018-05-30 §
10:52<zhuyifei1999_>undid both changes to tools-bastion-05
10:50<zhuyifei1999_>also making /proc/sys/kernel/yama/ptrace_scope 0 temporarily on tools-bastion-05
10:45<zhuyifei1999_>installing mono-runtime-dbg on tools-bastion-05 to produce debugging information; was previously installed on tools-exec-1413 & 1441. Might be a good idea to uninstall them once we can close T195834
2018-05-28 §
12:09<arturo>T194665 adding mono packages to apt.wikimedia.org for jessie-wikimedia and stretch-wikimedia
12:06<arturo>T194665 adding mono packages to apt.wikimedia.org for trusty-wikimedia
2018-05-25 §
05:31<zhuyifei1999_>Edit /data/project/.system/gridengine/default/common/sge_request, h_vmem 256M -> 512M, release precise -> trusty T195558
2018-05-22 §
11:53<arturo>running puppet to deploy https://gerrit.wikimedia.org/r/#/c/433996/ for T194665 (mono framework update)
2018-05-18 §
16:36<bd808>Restarted bigbrother on tools-services-02
2018-05-16 §
21:01<zhuyifei1999_>maintain-kubeusers on stuck in infinite sleeps of 10 seconds
2018-05-15 §
04:28<andrewbogott>depooling, rebooting, re-pooling tools-exec-1414. It's hanging for unknown reasons.
04:07<zhuyifei1999_>Draining unresponsive tools-exec-1414 following Portal:Toolforge/Admin#Draining_a_node_of_Jobs
04:05<zhuyifei1999_>Force deletion of grid job 5221417 (tools.giftbot sga), host tools-exec-1414 not responding
2018-05-12 §
10:09<Hauskatze>tools.quentinv57-tools@tools-bastion-02:~$ webservice stop | T194343
2018-05-11 §
14:34<andrewbogott>repooling labvirt1001 tools instances
13:59<andrewbogott>depooling a bunch of things before rebooting labvirt1001 for T194258: tools-exec-1401 tools-exec-1407 tools-exec-1408 tools-exec-1430 tools-exec-1431 tools-exec-1432 tools-exec-1435 tools-exec-1438 tools-exec-1439 tools-exec-1441 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
2018-05-10 §
18:55<andrewbogott>depooling, rebooting, repooling tools-exec-1401 to test a kernel update
2018-05-09 §
21:11<Reedy>Added Tim Starling as member/admin