21:00:36 <TimStarling> #startmeeting RFC meeting
21:00:36 <wm-labs-meetbot`> Meeting started Wed Sep 16 21:00:36 2015 UTC and is due to finish in 60 minutes.  The chair is TimStarling. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:36 <wm-labs-meetbot`> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:36 <wm-labs-meetbot`> The meeting name has been set to 'rfc_meeting'
21:01:03 <TimStarling> #topic Increase the strictness of mediawiki SQL code and leverage database code blockers for scalability  | Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/
21:01:29 <robla> o/ jynus
21:01:46 <jynus> hi there
21:01:58 <legoktm> o/
21:02:03 <ori> hello
21:02:36 <robla> jynus: thanks for putting together a great writeup so quickly!  I hope this didn't catch you (or anyone else) off guard
21:02:55 <TimStarling> #link https://phabricator.wikimedia.org/T112637
21:03:35 <Krenair> hey
21:03:42 <Scott_World_Univ> Hello everyone
21:04:33 <ori> so, some items to discuss
21:04:49 <ori> 1) the migration period -- one year seems excessively long to me; I think we can (and should) be more aggressive.
21:05:36 <matt_flaschen> Agree, 3-6 months should be enough.
21:05:36 <domas> the GROUP BY case was already handled for PG
21:05:44 <domas> lke, 10 years ago
21:05:45 <MaxSem> agree, "till it shines in CI/vagrant" is enough
21:06:07 <ori> 2) how do we identify existing violations? can we start enforcing these constraints in beta?
21:06:09 <jynus> domas, it happened again
21:06:17 <MaxSem> domas, I <3 mysql's group by so muuuuch:P
21:06:25 <MaxSem> PG is pain in the ass
21:06:29 <domas> inorite
21:06:34 <ori> (PG?)
21:06:39 <legoktm> postgres
21:06:46 <ori> oh
21:06:50 <jynus> core is in good shape
21:06:59 <robla> vacuum for the win  :-)
21:06:59 <jynus> some wmf-only extensions are not
21:07:02 <TimStarling> so the proposal is to make MediaWiki compatible with the TRADITIONAL mode
21:07:09 <domas> having mediawiki in strict'able form is good
21:07:23 <ori> jynus: which?
21:07:34 <jynus> there are examples on the RFC
21:07:48 <jynus> last one I can recall is centralauth
21:08:16 <Krenair> We fixed CentralAuth didn't we?
21:08:19 <ori> anomie -- are you aware of this, and do you have a sense of how much work it would be to fix?
21:08:25 <Krenair> At least, the Special:GlobalUsers one
21:08:36 <jynus> yes, want everyone's compromise to keep doing it :-)
21:08:48 <TimStarling> I'm not sure I entirely buy the arguments for not setting the mode in the session
21:09:05 <anomie> ori: What am I being aware of?
21:09:36 <ori> CentralAuth being potentially incompatible with TRADITIONAL mode
21:09:41 <TimStarling> we already have $wgSQLMode
21:09:53 <legoktm> it would just be adding a primary key right?
21:10:07 <Krenair> Where are these examples of extensions violating it?
21:10:22 <jynus> I do not feel strongly about that part, TimStarling
21:10:51 <TimStarling> $wgSQLMode is empty by default, which means by default, sql_mode is set to an empty string on connect
21:10:53 <jynus> but setting it on session
21:11:06 <jynus> means that extensions can choose to disable it?
21:11:12 <anomie> ori: I don't know anything about CentralAuth being or not being compatible.
21:11:14 <domas> TimStarling: I used to remove that :)
21:11:30 <jynus> plus 1 extra roundtrip on every connection?
21:11:32 <ori> <Krenair> Where are these examples of extensions violating it? <-- I can't find them either.
21:11:38 <TimStarling> no, extensions shouldn't be setting that kind of configuration
21:11:47 <TimStarling> no, it's not an extra roundtrip
21:12:02 <TimStarling> there is a single SET statement done on connect, unconditionally
21:12:04 <legoktm> there's a list of tables at https://phabricator.wikimedia.org/T17441#1420166
21:12:10 <jynus> ori, Krenair I gave links of examples of issues found in the past
21:12:18 <legoktm> but that only includes those that are on enwiki
21:12:18 <TimStarling> variables to set are concatenated, but there is always at least one at present
21:12:25 <TimStarling> $set = array( 'group_concat_max_len = 262144' );
21:12:30 <TimStarling> if ( is_string( $wgSQLMode ) ) {
21:12:36 <TimStarling> $set[] = 'sql_mode = ' . $this->addQuotes( $wgSQLMode );
21:12:36 <jynus> I did not create a report of extensions incompatible
21:12:41 <Krenair> okay, have we identified any remaining ones?
21:13:13 <jynus> Krenair, it is not easy. sql mode is not something that limits static code, but dynamic queries
21:13:25 <jynus> hence the time on beta to check it
21:13:32 <Krenair> It's fine if you haven't of course
21:13:35 <ori> #action Core is already in good shape; some extensions are not. We should create a report of incompatible extensions.
21:13:44 <ori> yeah, I think you can assign that to developers
21:14:13 <TimStarling> my inclination would be to have an active migration period, where we review extensions and fix them
21:14:21 <jynus> re: primary keys, there is a list on the linked ticket
21:14:26 <TimStarling> followed by setting the default $wgSQLMode to TRADITIONAL
21:14:45 <TimStarling> but this can and probably should be overridden in production to not set it, relying on the server configuration instead
21:15:01 <jynus> my priority is to block new features from adding more issues
21:15:02 <TimStarling> that way, developers will all be running TRADITIONAL
21:15:03 <domas> if RPC is there, it doesn't cost much to have it there
21:15:33 <ori> #info TimStarling in favor of setting the mode in the session. $wgSQLMode is empty by default, which means by default, sql_mode is set to an empty string on connect. We should have an active migration period, where we review extensions and fix them, followed by setting the default $wgSQLMode to TRADITIONAL.
21:15:52 <TimStarling> if someone needs to run an old unmigrated extension they can always set $wgSQLMode = '' in their LocalSettings.php
21:16:12 <Krenair> what about settings in beta?
21:16:24 <ori> and vagrant
21:16:27 <robla> TimStarling: re: active migration period....I think that's a good thing, but we do need to figure out how to get this in WMF dev team's respective roadmaps
21:16:32 <Krenair> Heh. And vagrant...
21:16:37 <domas> #info I have no idea what this means.
21:16:44 <domas> /o\
21:17:25 <TimStarling> do we use mysql in jenkins now? or is it still sqlite?
21:17:37 <robla> domas: notetaking trolling!  awesome...I leave it to you to innovate in the trolling dept  :-)
21:17:47 <ori> Krinkle: ^
21:18:00 <Krinkle> TimStarling: mysql
21:18:05 <TimStarling> domas: http://meetbot.debian.net/Manual.html#commands
21:18:20 <TimStarling> so we can at least set it to TRADITIONAL in jenkins
21:18:32 <ori> #action Update MediaWiki-Vagrant, Beta Cluster and Jenkins to run with $wgSQLMode = 'TRADITIONAL';
21:18:49 <domas> by the way, how does this make things more scalable?
21:18:51 <domas> ;-)
21:19:01 <domas> I hear "scalability' is the reason
21:19:37 <jynus> domas: I saw a meme generator, and added scalability and leverage, which souded cool
21:19:37 <Krinkle> Jenkins runs all tests on mysql now, and it's temporary scoped TMPDIR, as well as mysql data mounts are all in tmpfs RAM
21:19:50 <Krinkle> (for performance reasons)
21:19:51 <ori> jynus: is 'TRADITIONAL' enough, or would 'ONLY_FULL_GROUP_BY' be needed as well?
21:19:56 <jynus> both
21:20:19 <ori> #action Make that $wgSQLMode = 'TRADITIONAL, ONLY_FULL_GROUP_BY';
21:21:06 <jynus> can we clarify when we would allow to relax that?
21:21:20 <ori> jynus: re: <jynus> my priority is to block new features from adding more issues -- I should think that having the RfC approved here, and then following up with a quick note to engineering saying violators will be reverted should do the trick.
21:21:34 <ori> or wikitech-l
21:21:40 <Krenair> Wikitech-l, please.
21:22:03 <ori> jynus: do you think we need to allow exemptions?
21:22:12 <jynus> no, I would prefer not
21:22:21 <jynus> that is why I am asking
21:22:24 <ori> so let's not allow any
21:23:49 <ori> we have a hard time finding good DBAs because the cognitive load of maintaining our databases is enormous.
21:23:49 <jynus> so, to summarize
21:23:58 <Krenair> So... What is still left to deal with?
21:24:01 <Krinkle> Is there a way to give these options to sql mode in a warning-only away?
21:24:13 <jynus> PK on all tables (as soon as possible)
21:24:13 <Krinkle> I'd like to avoid having to run the whole test suite twice just to vary on sql option
21:24:23 <jynus> SQL strict
21:24:29 <Krinkle> (since we want it non-voting at first)
21:24:39 <jynus> there are a couple of other small issues
21:24:49 <jynus> and I used should in RFC meaning
21:24:51 <Krinkle> we can have a non-voting capture of the warning stream, that'd be more scalable for Ci
21:25:08 <jynus> trying to avoid unsecure/undeterministic statements
21:25:13 <ori> <jynus> PK on all tables (as soon as possible) <-- what do you need from developers to make that happen?
21:25:45 <jynus> well, I need to be able to ban new features to create tables without pks
21:25:52 <jynus> I did not feel backed about that
21:25:55 <TimStarling> should have been done years ago
21:26:10 <jynus> I can handle the transition myself, with time
21:26:16 <TimStarling> is someone arguing against it?
21:26:46 <MaxSem> isn't DBA's approval already required for schema changes?
21:26:47 <jynus> let's say there are some developers pushing strongly schema changes I do not 100% agree with
21:26:51 <ori> #info New tables without PKs will not be tolerated from now on.
21:27:15 <MaxSem> jynus, you -2 it and go on ;)
21:27:22 <Krinkle> Indeed.
21:27:32 * robla agrees with MaxSem
21:27:40 <jynus> that is exactly why I needed this :-)
21:27:59 <jynus> so, back to the pending topics
21:28:08 <Krinkle> jynus: I imagine some of this affects strictness on queries as well, not just schemas. Which are less-strongly reviewed, but should also ideally be enforced socially for new code.
21:28:20 <jynus> recommend against undeterministic queries
21:28:25 <Krenair> jynus, are you willing to point out any particular changes you think are harmful?
21:28:38 <jynus> because they are causing data corruption in labs
21:29:24 <MaxSem> because I'm sure many people don't understand "nondeterministic queries", this needs to be documented well on mw.org
21:29:28 <Krinkle> jynus: Hm.. replication copies the query, not the impact? I thought binlogs weren't in sql. Or is it different for slave repl and labs repl. (I'll defer outside the meeting, personal curiousity)
21:29:48 <jynus> yes, hence the ticket to change to row-based replication
21:30:09 <jynus> but until that, try to minimize the impact with current production configuration
21:30:38 <jynus> I say here "should" and not "must", because this is the hardest one to detect and fix
21:31:01 <jynus> but I know that it happens because the slave drift on the labs slaves
21:31:05 <TimStarling> nondeterministic write queries?
21:31:26 <Krinkle> (outlined at https://phabricator.wikimedia.org/T112637)
21:31:36 <jynus> something like "UPDATE... limit without order"
21:31:41 <jynus> we do not do that
21:31:49 <jynus> but there are other things that we do
21:32:03 <jynus> like insert ... select with auto_increment fields
21:32:09 <TimStarling> well, this is a "must" recommendation for code review
21:32:11 <jynus> resulting on different ids on some slaves
21:32:17 <TimStarling> you are only worried about accidentally letting one through, right?
21:32:24 <jynus> they are very subtle to detect
21:32:29 <TimStarling> anyone who sees such a thing should give -2
21:32:44 <TimStarling> well, -1, I guess, if it is correctable
21:32:58 <jynus> yes, sure, but the only real solution is row based replication
21:33:11 <jynus> until I can do that, try to minimize impact
21:33:26 <jynus> I can create a list of offenders to check
21:33:43 <jynus> and some if them are in core
21:34:14 <jynus> I think that is all
21:34:43 <Krenair> We should be able to make core warn/throw an exception on those, assuming they're not just completely raw SQL.
21:34:49 <ori> #info We need to be strict with queries, not just schemas. There are nondeterministic write queries running in production -- we know that this happens because now that it happens because we see slave drift on the labs slaves. They are not always easy to spot in code review.
21:34:54 <TimStarling> what is the procedure in gerrit btw? do you want to be added as a reviewer on all database-related changes?
21:34:58 <jynus> actually, sql thouws a warning for those
21:35:01 <Krenair> Assuming it currently supports that. . .
21:35:03 <Krenair> ok
21:35:08 <jynus> we can detect them dynamicly
21:35:25 <ori> #action TimStarling: Anyone who sees such a thing [i.e., nondeterministic write queries] should give -1 / -2 (depending on whether the issue is correctable).
21:35:39 <TimStarling> that's not an action
21:35:42 <jynus> the throw "[Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT"
21:35:51 <ori> TimStarling: yeah, my bad.
21:36:09 <Krenair> do we get these warnings and the bad query sent to the logs on fluorine?
21:36:12 <jynus> but I think warnings may not be logged, and there may be too many of them?
21:36:16 <spagewmf> did anyone respond to Krinkle's "Is there a way to give these options to sql mode in a warning-only way?"
21:36:19 * robla wants to know the answer to TimStarling's question about gerrit procedure
21:36:51 <jynus> in general, thigs like truncations
21:36:56 <jynus> produce warnings
21:36:58 <jynus> too
21:37:05 <jynus> only errors when in strict mode
21:37:14 <jynus> so that is a way to check those
21:37:27 <TimStarling> ok so someone needs to volunteer to figure out where the logs go and where they should go and how they can be routinely reviewed
21:37:33 <Krinkle> jynus: Ok. so we'll need a way for jenkins to extract those warnings from wherever they're sent. We can add them to the assertion that already checks mediawiki exception and warning logs to be empty etc.
21:37:39 <ori> #link https://dev.mysql.com/doc/refman/5.0/en/sql-mode.html#sql-mode-strict
21:37:51 <legoktm> I think https://gerrit.wikimedia.org/r/#/c/198661/ is related? "[WIP] Optionally log warnings generated by MySQL"
21:38:01 <jynus> check the code on the RFC for "Query OK, 1 row affected, 4 warnings"
21:38:14 <Krinkle> Tricky bit is that Jenkins slaves share one local db instance. They have separate username/databases, but it's only one server, so no job-specific error log
21:38:23 <ori> legoktm: yes, looks like it
21:38:24 <Krinkle> But if we can extract them from mediawiki in resposne to a query, that'd be cool
21:38:40 <TimStarling> uploaded in march?
21:38:46 <ori> we suck
21:39:25 <ori> TimStarling: could you review it? :)
21:39:38 * ori itches for an #action.
21:39:52 <TimStarling> I added jynus and myself as reviewers
21:40:12 <ori> #action TimStarling and jynus to review https://gerrit.wikimedia.org/r/#/c/198661/ (Optionally log warnings generated by MySQL).
21:40:18 <ori> \o/
21:40:30 <jynus> be careful
21:40:33 <spagewmf> jynus: that's what confuses me. the first run in T112637 prints "Query OK, 1 row affected, 4 warnings (0.00 sec)", but no warnings appear. In addition to jenkins, we want a default developer configuration (in MediaWiki-vagrant and in the mw.org debugging page) that loudly prints these warnings
21:40:43 <jynus> warnings are disabled on mysql's error log
21:40:58 <jynus> because they may fill up the disks :-)
21:41:10 <jynus> in the client command line
21:41:11 <TimStarling> jynus: ok, so the change proposed should be what you want then
21:41:17 <Krenair> Do we generate a huge number of warnings?
21:41:27 <TimStarling> I was going to ask whether we should do it via the server but that sounds like a no
21:41:37 <TimStarling> the change proposed is to log via MediaWiki's debug log system
21:41:41 <jynus> something like enabling it for a server first, check volumne
21:41:56 <Krenair> so we don't actually know what the volume is at the moment?
21:42:07 <TimStarling> we can just enable it and see
21:42:09 <jynus> *size, fix huge issues, before enabling it globally
21:42:21 <legoktm> MW logging supports sampling if its really bad
21:42:22 <ori> jynus: MediaWiki can turn log channels on / off and apply a sampling factor to reduce the rate
21:42:35 <ori> so it's pretty safe to try it and see
21:42:38 <TimStarling> yeah we can manage noisy logs, we have plenty of experience with that
21:42:39 <jynus> yes, just I am thinking as an op here, being careful
21:42:42 <jynus> :-)
21:42:55 <spagewmf> MediaWiki-Vagrant's config is different than production, though it's reused for some labs machines
21:42:55 <jynus> spagewmf, command line client
21:43:08 <jynus> does not show warnings by default
21:43:18 <jynus> either you SHOW WARNINGS
21:43:28 <jynus> or change mysql command line client config
21:43:39 <TimStarling> I don't mind merging it, but I don't want it to be my job to push it through to production and check log volume and all that
21:43:45 <jynus> connectors tend to do the same, not returning warnings by default
21:43:48 <ori> TimStarling: I can do that part
21:43:56 <TimStarling> thanks ori
21:44:15 <ori> #action Once https://gerrit.wikimedia.org/r/#/c/198661/ is merged, Ori to deploy and monitor log volume.
21:44:30 <ori> does the beta cluster have a slave database server?
21:44:47 <Krenair> yes
21:45:40 <ori> <jynus> but I know that it happens because the slave drift on the labs slaves
21:45:45 <TimStarling> (btw the #action command should start with an IRC nickname parameter, meetbot organises action items by name)
21:45:49 <ori> do we have monitoring for that?)
21:45:53 <jynus> yes, ori, there is a ticket
21:46:09 <ori> ack
21:46:27 <jynus> https://phabricator.wikimedia.org/T111371
21:46:42 <jynus> labs slaves (sanitarium, actually)
21:46:51 <jynus> got heavely desynced
21:47:14 <TimStarling> ok, was there anything else to discuss?
21:47:21 <ori> #link https://phabricator.wikimedia.org/T111371
21:48:05 <TimStarling> we won't need a wrap-up period if we've covered everything already
21:48:15 <ori> I think it would be useful to pronounce this formally accepted, to put some muscle behind the enforcement
21:48:37 <jynus> yes, I would like to add the accepted bits the the mediawiki documentation
21:48:55 <spagewmf> jynus: hi, I am Technical Riter :)
21:49:06 <jynus> :-)
21:49:30 <ori> #action spagewmf to update MediaWiki documentation
21:50:32 <chasemp> jynus: really great writeup and effort on T112637 man, thank you
21:50:34 <spagewmf> ori: do you think that phab change is enough to make these warnings more visible on MediaWiki-Vagrant instances?
21:51:05 <ori> no, there are some additional tweaks can make to the configuration
21:51:11 <spagewmf> (maybe it's premature if the volume of warnings is overwhelming)
21:51:25 <ori> the volume wouldn't be overwhelming for vagrant
21:51:41 <ori> chasemp++
21:51:41 <TimStarling> maybe jynus wants to give the task description another edit before I mark it accepted?
21:51:54 <jynus> yes, I can do that
21:52:00 <Krenair> ori, deployment-db2 is the slave of deployment-db1 in beta, by the way
21:52:05 <spagewmf> ori: great, ping me on the changes and I'll put them in the mw.org debugging page for people not using MW-Vagrant
21:52:07 <ori> "Stop fucking with my databases. --jynus"
21:52:10 <jynus> basically it was changing the session thing
21:52:22 <TimStarling> yeah
21:52:31 <robla> I think it'd be ok to trust jynus to get this stuff done after its approved, fwiw
21:52:41 <spagewmf> ori Krenair : I think wmf-config/db-labs.php is the config
21:52:45 <jynus> 3-6 moths we said?
21:52:47 <Krenair> yes
21:53:25 <ori> let's say no more than 6, and leave the option of making the timeline narrower if the number of issues is not unmanageable
21:53:48 <TimStarling> will we need to schedule another of these meetings on this topic? if no, then I will move it to the approved column
21:53:58 <Krenair> (yes was to spagewmf)
21:54:16 <robla> I don't think we need to schedule another meeting on this unless jynus wants to clear some things up
21:54:34 <jynus> no, just help with the review
21:54:47 <TimStarling> ok
21:54:50 <ori> I think having a check-in meeting sometime in the future would be good, but not to debate the proposal, but evaluate progress. So not a blocker to approval.
21:55:18 <ori> (And perhaps these meetings aren't the right venue for check-ins, dunno.)
21:55:23 <spagewmf> I look forward to wmgDBJynusNoF___Mode = true in wmf-config
21:55:25 <robla> ori++
21:55:28 <TimStarling> #agreed RFC approved, assuming decisions from this meeting will be incorporated into the phabricator task
21:55:47 <jynus> I did a first edit
21:55:54 <jynus> 5 minutes to review it :-)
21:56:07 <jynus> I can add the log thingy, too
21:56:15 <ori> nice work jynus! I think this is the first RfC that was entirely done by a member of the TechOps team
21:56:25 <ori> so it's an important milestone
21:56:42 <robla> yup, excellent job jynus!
21:57:07 <spagewmf> are T-shirts still a thing
21:57:18 <ori> \o
21:58:01 <jynus> ha ha
21:58:07 <TimStarling> #endmeeting