Skip to content

Instantly share code, notes, and snippets.

@gwicke
Last active November 13, 2015 23:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gwicke/df8b347058a19e6556f6 to your computer and use it in GitHub Desktop.
Save gwicke/df8b347058a19e6556f6 to your computer and use it in GitHub Desktop.
@gwicke
Copy link
Author

gwicke commented Nov 13, 2015

[11:19] Krinkle: you around?
[11:19] Yep
[11:19] hey, just responded on the patch, but might be quicker to discuss here
[11:19] MediaWiki-General-or-Unknown, Browser-Support-Internet-Explorer, Upstream: The XSS filter in IE8/9 breaks certain tools - https://phabricator.wikimedia.org/T34013#1804400 (Catrope)
[11:20] gwicke: Oh I see what you mean.
[11:20] Right, there is definitely a visibility aspect here
[11:20] how is the filtering between private & public events intended to work in RCFeed?
[11:20] I think we'll need to make that a switch based on configuration
[11:20] Not part of any class
[11:20] E.g. private=>true for the EventBus feed
[11:20] in some cases, it's just about including some extra info
[11:20] Yeah
[11:20] like for page deletes
[11:20] Let me see
[11:21] but, there are also changes that might just never make sense in RCStream
[11:21] user registrations for example
[11:22] you mean, account creations?
[11:22] yeah
[11:22] those are and should be included
[11:22] Why would they not make sense in RCStream?
[11:22] mostly because they might be sensitive
[11:23] listen to wikipedia uses them to make noise. Special:RecentCHanges and Special:Log includes them. Counter vandalism bots uses them to match against certain patterns to find repeated abusers and what not.
[11:23] (PS1) Mholloway: Refactor TriggerAbuseFilterTest [apps/android/wikipedia] - https://gerrit.wikimedia.org/r/252970 (https://phabricator.wikimedia.org/T115903)
[11:23] If they are sensitive they shuld not be in RC.
[11:23] If they are not in RC they will not be in RCFeed.
[11:23] do they include ips?
[11:23] user blocks?
[11:23] Improving access: Hashed IP addresses in refined webrequest logs - https://phabricator.wikimedia.org/T118595#1804411 (leila) NEW a:Ottomata
[11:23] this is a good list of things analytics would like to have eventually: https://meta.wikimedia.org/wiki/Research:MediaWiki_events:_a_generalized_public_event_datasource#Relevant_events
[11:23] User block events are public.
[11:23] The IP that created an account is not public.
[11:24] wgLogRestrictions is relevant
[11:24] That controls whether a log event will go to logging table only, or go to recentchanges + feed as well.
[11:24] If it is restricted, it never makes it into the rc table. So it's not hidden or antyhign like that, it's just never thre to begin with.
[11:25] okay, that sounds pretty good
[11:25] Regardless of user rights.
[11:25] (PS45) Ottomata: EventLogging processor as a service via HTTP [extensions/EventLogging] - https://gerrit.wikimedia.org/r/235671 (https://phabricator.wikimedia.org/T114443)
[11:25] well, we really need revision suppression info
[11:25] However I think you want those events in Restbase right?
[11:25] Right
[11:25] I think we can implement visibility of rows, similar to what we did with the revision table.
[11:26] so by default everything that goes to RCFeed goes to the logging table?
[11:26] And that would cascade through various consumers. E.g. we can choose to make them visible to sysops on-wiki on Special:recentChanges, we can choose to make them visible in certain RCFeed configurations (e.g. the one for EventBus). And we can even do it in two layers, e.g. if we later re-implement RCStream on top of event bus, we would filter them out at the
[11:26] edge.
[11:27] MediaWiki-API, Technical-Debt: Reduce the usage of API format=php - https://phabricator.wikimedia.org/T118538#1804425 (Yurik) >>! In T118538#1803847, @Luke081515 wrote: >>>! In T118538#1803393, @yurik wrote: >> @Luke081515, what encoding problems do you have with JSON? > > For example, when I get letters...
[11:27] We can also choose to not make them visibel in the UI regardless of user rights to keep it safe and as-is.
[11:27] users don't expect it there right now.
[11:27] same for the PHP API
[11:27] gwicke: Anything other than suppressionlog?
[11:28] is all page deletion information public?
[11:28] The typical revision delete concerns don't apply to RC because it's all after the fact.
[11:28] IIRC there was something about the title being hidden
[11:28] So that means, inevitably, if you poll recent changes you will have information that can later become invisible.
[11:28] (CR) jenkins-bot: [V: -1] EventLogging processor as a service via HTTP [extensions/EventLogging] - https://gerrit.wikimedia.org/r/235671 (https://phabricator.wikimedia.org/T114443) (owner: Ottomata)
[11:28] Hm..
[11:29] * Krinkle caught himself trying to go to Special:Delete/Sandbox
[11:29] If only!
[11:29] hehe
[11:29] Delete form has no suspression option
[11:30] (CR) Qdinar: "$variantfallbacks = array(" [core] - https://gerrit.wikimedia.org/r/164049 (https://phabricator.wikimedia.org/T27537) (owner: Qdinar)
[11:31] Krinkle: if we can figure out a safe way to filter those logs into private & public that doesn't rely on each formatter to handle each event individually, then the RCFeed thing could work well
[11:31] it'd basically be some kind of internal subscription / sanitization layer before handing it to the formatters
[11:31] MediaWiki-File-management, Commons, Multimedia: Image redirects from the shared repo don't show "redirected from" - https://phabricator.wikimedia.org/T16117#1804438 (Catrope)
[11:32] I haven't really investigated how much filtering capability there is already
[11:32] gwicke: Yeah, I think we'd end up with someting lik $wgRCFeeds['...'] => array( formatter => ..class.., includeRestricted => true, other config passed to constructor )
[11:32] and then the base logic would provide it.
[11:33] it shouldn't be dealt with in the invididual engine and formatter I think
[11:33] OK. So it's more complicated than I thought
[11:33] Surpress is different byd efault in MediaWIki
[11:33] MediaWiki-extensions-WikibaseRepository, Wikidata, Patch-For-Review, Story, Wikidata-Sprint-2015-11-03: [Story] Add icons to text in toolbars - https://phabricator.wikimedia.org/T87757#1804442 (Jonas)
[11:34] Krinkle: I guess another option would be to use RCFeed only for the public stream
[11:34] and feed it from something data / filter focused
[11:34] MediaWiki-Interface: [collapsibleTabs] Tabs (Read / View Source / Search) wrap to next line and cover content if screen width < ~700px - https://phabricator.wikimedia.org/T56919#1804448 (Catrope)
[11:34] MediaWiki-Database, MediaWiki-Page-editing, Performance-Team: Links tables are sometimes not being populated - https://phabricator.wikimedia.org/T117332#1804449 (aaron) a:aaron
[11:35] that's just about how we wrap the functionality into objects, though
[11:35] gwicke: So delete does not have any supression options. However, after the delete, users with the available rights can hide entires from Special:Log. (which is a generic feature)
[11:35] MediaWiki-extensions-ConfirmEdit-(CAPTCHA-extension), Flow, Collaboration-Team-Current: Requests to the API should return CAPTCHA in the UI language - https://phabricator.wikimedia.org/T117112#1804453 (Catrope) Open>Resolved
[11:35] MediaWiki-Special-pages, Patch-For-Review: Uncategorized categories/pages/templates/files refactoring - https://phabricator.wikimedia.org/T14942#1804456 (Catrope)
[11:35] That's mostly meant for indirect log actions though, not something like delete, since delete will still be visible on recent changes I think.
[11:35] Testing now
[11:36] (PS1) VolkerE: Add icon for 'View Source' menu item [skins/Blueprint] - https://gerrit.wikimedia.org/r/252973 (https://phabricator.wikimedia.org/T104213)
[11:36] Yeah, hiding a log/delete event with all options ticked will still keep the RC entry visible on Special:RecentChanges, even retroactively.
[11:36] Krinkle: IIRC the concern was that people encode sensitive / offensive information in titles, and we need a way to get rid of it
[11:36] even for anons
[11:36] it'll be invisible from Log though
[11:37] It seems important also so that consumers can maintain some kind of reasonable state.
[11:38] The RC entries for page creations and other revisions will be deleted from RC tables during page delete, so any one trying to catch up from the past via Special or API won't see it. But the page deletion event stays in RC
[11:38] That's the status quo.
[11:38] MediaWiki-skins-Blueprint, Patch-For-Review: No icon for "View source" in sidebar - https://phabricator.wikimedia.org/T104213#1410128 (Volker_E)
[11:39] Krinkle: if we wanted to filter that from an event stream with historical data we'd probably need to filter on the way out
[11:39] have a blacklist of titles, or something like that
[11:41] MediaWiki-extensions-WikibaseClient, Wikidata, Easy, Need-volunteer: Add a keyboard shortcut to "Add links" from Wikipedias - https://phabricator.wikimedia.org/T74808#1804480 (darthbhyrava) Okay, I'll go ahead with , and get working on the patch, then!
[11:43] <James_F> multichill: Hey!
[11:44] Krinkle: I think we'll need to make emitting events a bit more general than https://github.com/wikimedia/mediawiki/blob/b71c537ef82f5dedd23a8dcc3f41a975f0c0e7b2/includes/changes/RecentChange.php#L355-L404
[11:44] James_F: We were talking about structured metadata on Commons in the context of https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey#Structured_metadata_for_Commons
[11:45] gwicke: I'm not sure about a blacklist, but I agree event emission should be more general than recent changes. RCStream is one topic of events. Currently the most mature one. But I would very much hope we not try and squish everything into that.
[11:45] nod
[11:45] better to layer

[11:54] Krinkle: ping
[11:54] Krinkle: do you have a sec to talk about https://gerrit.wikimedia.org/r/#/c/252950/2?
[11:55] I do
[11:55] gwicke:
[11:55] i think we (I) have some crossed wires, and maybe it means that doing this as an rcfeed is the wrong choice
[11:56] https://phabricator.wikimedia.org/T116786 is about writing events from MW to this new event servicee
[11:56] urandom: krinkle & I just chatted in -dev, see https://gist.github.com/gwicke/df8b347058a19e6556f6 for a log
[11:57] * urandom is reading
[12:03] to me, it seems that we could have a thin layer below RCFeed to handle events in general, and then dispatch those that should be included in feeds to RCFeed
[12:03] Not unlike log groups, however.
[12:03] Which uses Monolog now
[12:04] wgFeedGroups['rcfeed'][], perhaps similar to wgHooks?
[12:04] keyed by "topic"
[12:04] or recentchanges as topic rather, not rcfeed
[12:04] yeah, and each event could have several topics set
[12:04] and then another for other types of events.
[12:04] Several? That would get confusing
[12:04] With regards to formatting expectations
[12:05] sure; if we make the topics fine-grained then the feeds can list what they are interested in
[12:05] I was thinking about including a 'rcfeed' meta topic
[12:05] We should not hardcode seemingly arbitrary aggregations of topics inside mediawiki core. If we want that maybe we should have a kafka consumer on mediawiki_recentchanges, and have it feedback into Kafka for specific things and formats.
[12:05] yeah, agreed
[12:05] to a different topic.
[12:06] the other aspect is the private / public data split per event
[12:06] one way to handle that would be to emit two events right at the source
[12:06] Yeah. As long as it's all bound to recent changes I think we can do it on that side of the devision.
[12:06] Makes sense
[12:07] OK. So yeah, we need them to be outside recentchanges. I realise why now.
[12:07] Because tey would not have an rcid
[12:07] unless they are in the table
[12:07] and if they are in the table, we break the interface between consumers of the table and require filtering, which Id like to avoid.
[12:08] yeah, unless all those consumers go through a single accessor it would be hard to enforce consistent filtering
[12:08] It would be interesting to have the SQL writer be an inphp consumer of the feed, but can't be because of rcid
[12:08] which is only known after writing
[12:08] on the other hand, having a table with all events could be nice as a low-fi eventbus backend
[12:08] Yeah
[12:08] Hm..
[12:09] we were talking about consistent event emission as well
[12:09] it's a lot easier to guarantee consistency if you can write to a table as part of a primary transaction
[12:09] Yeah, we can add restricted to the rc table. Index on it. Hidden by view. Hidden from feeds by default. Optionally made visible. For Redis, IRC etc. (which are currently unfiltered to the public) they don't change anything. For internal Kafka we can do the full feed including restricted.
[12:10] But we may still need an idea of topics at some point.
[12:10] But for the initial use case it seems like rc covers it
[12:10] a lot of other things currently use logstash and statsd instead.
[12:11] yeah, which makes sense if you don't need 100% reliability
[12:11] It does mean we can only emit events from POST requests.
[12:12] I mean, with the goal to not connect to master on GET.
[12:12] which is already the case, but something to keep in mind
[12:12] that's fine for the events we have in mind right now
[12:13] The added rc table field should be simple, but not trivial.
[12:13] now, if we used this as a backend for eventbus in small installs, would it be fine to write all kinds of events in there?
[12:14] Hm..
[12:14] like RESTBase signaling that HTML for some revision was rerendered?
[12:14] Not sure I follow
[12:14] eventbus is a general event system
[12:14] Right
[12:14] it'll include events from different services, including MW, RB etc
[12:15] would small installs have RB?
[12:15] a lot of the functionality will be job queue like
[12:15] yes, I am operating under the assumption that small installs will have Parsoid, RB and VE
[12:15] and some form of EventBus
[12:16] Right and we want to stop MEdiaWiki from being directly aware of RB (with the RB extension)
[12:16] yeah
[12:16] Do we define EventBus as a protcol, or as a service?
[12:17] RCfeed right now is an interface/protocol. It can be whatever one configures it to be.
[12:17] it's in flux; the minimal definition is queuing functionality similar to kafka's
[12:17] There is no RCFeed composer or node package.
[12:17] a medium definition is that there's a producer API as well
[12:17] the full-fledged definition might eventually have a consumer API, too
[12:18] the implementation is at the 'medium' stage
[12:18] We could have a MediaWiki extension that implements the basic system as an SQL table. And exposes its HTTP through the MediaWIki API :/
[12:19] Krinkle: yeah, especially for the producer side that should be fine
[12:19] websockets might be a bit tricky in that scheme, but.. maybe with hack?
[12:19] But how RB is going to listen to that is another matter..
[12:20] Right now Parsoid and RB both don't need a complicated installation right? Parsoid is a service that works out of the box with npm-install and a config file. RB can fallback to sqlite
[12:20] yeah
[12:20] if we do eventbus as a node service, then it needs a database.
[12:20] and we should be able to bundle both into a single service
[12:21] using service-runner to offer both services in a single node worker on low-mem installs
[12:21] Right
[12:21] I won't be able to install this on my 2 wikis that I run in shared hosting though.
[12:21] yeah, the other option would be to leverage RB's storage modules
[12:21] I was able to enable memcached and APC and upgrade to PHP 5.6 from their config panel. But cgi through apache only.
[12:22] the packaging discussion is a fun one
[12:23] hopefully, we'll eventually have either 'docker run mediawiki', or 'apt-get install mediawiki'
[12:24] I can upload any PHP and python. but it's cgi through apache. And no node.
[12:24] Now I could easily add 1 or 2 dollars per month and then I have it elsewhere
[12:24] There are old sites I don't mind upgrading. It's just an example :)
[12:24] you can get a full VM for $2 a month ;)
[12:24] http://serverbear.com/
[12:24] But that means I"m now maintaining a server
[12:24] I don't want that
[12:24] I've spent less than 4 hours in total on these sites in the past 3 years.
[12:24] yeah, hence docker or the like
[12:26] this conversation isn't easy to following coming into late, can someone summarize?
[12:26] s/following/follow/
[12:26] urandom: an issue is that we need some private events that aren't in rcfeed right now
[12:27] the ones we need right now are compatible RC, theyre just omitted right now.
[12:27] So for now we don't need an extra layer of topics yet and where to store them.
[12:27] so we were thinking about ways to filter in a central place
[12:28] * ori still does not understand the crux of the disagreement
[12:28] But we do need to figure out the stock install strategy for eventbus.
[12:28] There is no spoon, ori.
[12:28] is there a spork? that would do, in a pinch.
[12:28] :)
[12:28] a Titanium Spork.
[12:28] Krinkle: what are you objecting to?
[12:28] Nothing in fact.
[12:28] by Light My Fire
[12:29] what is gwicke objecting to?
[12:29] gwicke: So let's go concrete.
[12:29] or are the two of you in ferocious agreement with one another?
[12:29] Let's try something and see what we think.
[12:29] ori: we were just brainstorming
[12:29] I got to run, sorry
[12:29] a node service for eventbus, bundleable with parsoid/rb via service-runner for small installs
[12:29] well, don't I feel like an asshole now! :P
[12:29] using the same sqlite db?
[12:29] Krinkle: why node?
[12:29] configurable with mysql of course
[12:29] * urandom 's head implodes
[12:30] there was the decision to extend EventLogging IIRC, and ottomata is working on packaging
[12:30] ori: node or php, for pubsub, you probably won't wnat php.
[12:30] ori: This is not for prod.
[12:30] Krinkle: there's a history there, let's not regress
[12:30] But we are not going to advocate that the minimal working install of MediaWiki + VisualEditor includes Kafka and EventLogging
[12:31] why not?
[12:31] It needs Parsoid + RB, and right now that works because RB has a MW extension that will inform RB about any actions it needs to hear about.
[12:31] That is something we want to replace.
[12:31] next year in jerusalem
[12:32] MW is going to get a generic interface (designed after Kafka) that has a topic and a message, basically. And how you publish it, and with what, is configurable, null, by default. We'll use Kafka.
[12:32] anyways
[12:32] or EventLogging
[12:32] either weay
[12:32] But for plain installs we're thinking what is the right approach.
[12:32] urandom is new to mediawiki-core development, so let's ease him into our wonderful world of "perhaps you thought you had consensus, but i'm here now to challenge all the points of agreement you thought you had finally established"
[12:33] \o/
[12:33] They already have a node server running with parsoid and restbase (independent, but running alongside in the same service-runner)
[12:33] and in general make room for the people doing the development work to actually do the thinking
[12:33] OK. I didn't know that!
[12:33] Krinkle: this was discussed, in RFC meetings, and in Phab threads long enough to make Samuel Richardson feel inadequate
[12:35] OK. Let's start again with the commit we have in Gerrit.
[12:36] that, is based on the discussions in the ticket, and some here with legoktm
[12:36] urandom: We both miss some context. I for one, am missing the context of EventLogging and Kafka somewhat. I'm aware with that being the direction and am absolutely fine with that. I'm here to learn.
[12:36] urandom: I'd like to understand what brought you to the state of this commit. E.g.is the $event format being modelled after something pre-existing?
[12:37] for some value of pre-existing, yes
[12:37] the schemas are under discussion at wikimedia/restevent#5
[12:37] there is a service that ottomata has been working on, and another at https://github.com/wikimedia/
[12:38] and yeah, the events used with those APIs can be found here: https://github.com/d00rman/restevent/tree/basic-events/schemas
[12:39] Krinkle: so the assumptions that led to that gerrit are that we'd hook into MW somehow, to do an HTTP post to a service, using events formatted according to those schemas
[12:39] and there seemed to be some consensus that RCFeed was a good starting point
[12:39] OK. I don't want to re-open any made decisions, but I think there is some leeway within implementation that may make it seem different, but maintains the same semantics. For example, we don't need to filter down the stream to just the three hand picked sub events. That could introduce an odd subset into the mix that is harder to re-use. If we need a narrow
[12:39] subset for RestBase, I think we can either make the events configurable, or we can have a Kafka consumer within Wikimedia's set up that will listen to recentchanges from EventBus and feed the subset back into another Kafka topic that REstbase will consume.
[12:39] urandom: Yeah, that sounds good.
[12:40] what do you mean by making the events configurable?
[12:41] Like $wgRCFeeds['eventbus'] => array( ... rcType => array( 'log' ) )
[12:41] it seems that we could evolve how events get into RCFeed slightly to support some amount of filtering / mapping
[12:41] or something like that.
[12:41] But I don't think we need to do that per se.
[12:41] Krinkle: i see
[12:42] yeah, i'd though about adding a configurable filter of some kind
[12:42] I mean, it depends. We want to use this system for more things, so might as well just expose recentchanges as-is in a format we know and understand.
[12:42] otherwise the feed should be called restbase-purges
[12:42] we need a way to add new events without them showing up in all feeds by default
[12:43] & some way to audit what goes where
[12:43] define events. You mean things that are not recentchanges?
[12:43] gwicke: I guess you don't want the restbase purger to receive all of rcfeed?
[12:43] so, pushing everything to a kafka topic (or two, if i understood, private and public), and then reprocessing them into eventbus topics was one option?
[12:44] recentchanges is a collection of events that are all deemed useful & not too sensitive for a public recent changes feed
[12:44] I think 'restricted' would be a JSOn property of the blob pushed to mediawiki_recentchanges topic. But it coudl also be a separate feed.
[12:44] that's how I see it, at least
[12:44] it includes several different events
[12:44] Yeah
[12:45] Krinkle: that's interesting
[12:45] edits, new pages, log events, categories.
[12:45] those are the main RC types
[12:45] and log events has many subtypes. Many of which are provided by MediaWiki extensions and plug ins.
[12:45] it seems easy, we already have code in eventlogging for this, i think
[12:46] categories is actually not exposed in rcfeed. So that gives us a nice compat list: new pages, edits and log events.
[12:46] compact*
[12:46] we'd want to add suppression events
[12:47] They have a generic schema to them defined in MachineReadableRCFeedFormatter that doesn't vary on the subevent. So we can make a schema for it that is generic. E.g. not specific to delete events.
[12:48] urandom: Yeah, so that is the only tricky part. There is particular type of event, suppression. Which is important to RESTBase, and is not currently provided by the RCfeed system. So as-is it wouldn't solves the first use case.
[12:48] is suppression not something that could be added to recentchanges?
[12:48] even a binary 'recentchanges' vs. 'suppression' topic distinction would already be useful to us
[12:49] but, there are already types that we could elevate to the topic level
[12:49] I'm checking now whether suppression itself also emits a log event.
[12:49] I wonder if one can suppress a suppression event?
[12:49] I think so
[12:49] but then that one also emits a suppression event
[12:50] which should be visible internally only
[12:50] so that RB can suppress content
[12:50] * Foo deleted page Sandbox.
[12:50] suppression is for deletes?
[12:50] urandom: Surpression is for any log entry.
[12:50] Account creation, page move, page delete, user block, anything.
[12:51] See the dropdown menu at https://en.wikipedia.org/wiki/Special:Log
[12:51] yeah, there's two kinds of suppression ;)
[12:51] one for log entries, and one for revisions
[12:51] Authorised users have an additional UI component on that page that allows them to suppress an entry.
[12:51] That will flip a flag in the logging table, causing the UI rendering of that list item to be greyed out.
[12:52] what is that used for?
[12:52] And it then adds a new log entry above it that says a suppress took place.
[12:52] to make a log/entry secret?
[12:52] Foo changed visibility of a delete log event.
[12:52] :/
[12:52] Here' an example.
[12:53] I just deleted my local wiki's Main page
[12:53] 19:29, 13 November 2015 Root (Talk | contribs | block) deleted page Main Page
[12:53] It says that on my Special:Log
[12:53] Now, I hide that event. Which changes the UI rendering of that log entry to this:
[12:53] I agree that this could be an attribute
[12:53] 19:29, 13 November 2015 (username removed) (log details removed)
[12:53] redacted
[12:54] 19:35, 13 November 2015 Root changed visibility of a delete event on Special:Log
[12:54] but, that doesn't necessarily address the revision suppression use case
[12:54] Yeah, I don't think Restbase is interested in log/supress
[12:54] but you want rev delete
[12:54] which is similar
[12:54] it's actually an event topic
[12:55] rather than a restriction on which events (or parts of events) are visible to which audience
[12:55] The attribute doesn't work indeed. Because no event would ever be restricted from that point of view.
[12:56] I mean, the delete evenet wouldn't be restricted when the consumer originally got it.
[12:56] ok, so tl;dr, these suppressions aren't something that can be handled by recentchanges?
[12:56] It can be, but currently is omitted from it
[12:56] It is added to Special:Log
[12:56] k
[12:56] and almost all of Special:Log is also in Special:recentChanges (limited to 30 days)
[12:57] but suppression is currently omitted from recent changes
[12:57] it depends on whether "recentchanges" is what you see in public recent changes, or if it's "all events"
[12:57] The logging system has a 'restricted' attribute which controls whether a user sees it when they view Special:Log
[12:57] the recentchanges system does not currently have this attribute.
[12:57] But we could add it.
[12:58] in the short term, it seems safer to not feed sensitive information into RCFeed
[12:59] Well, it would be hidden by default.
[12:59] one way to guarantee that could be to dispatch those events at a layer below
[13:00] the simplest version of that could perhaps be to start with a single internal topic called 'recentchanges'
[13:01] we can then add a second one called 'suppression', and subscribe the eventbus producer (but not others) to both
[13:03] Yeah
[13:04] later, we could consider un-bundling 'recentchanges' into different events
[13:04] gwicke: when you say 'topic', do you mean to have MW produce to kafka directly, and then subscribe and forward to eventbus?
[13:04] So I checked where revision delete comes in. Both revision delete and log suppression are part of log type 'delete'. With log actions delete/revision,and delete/logging respectively.
[13:05] or are you referring to 'topic' conceptually here
[13:05] urandom: in this context, I mean a string property we can match on
[13:05] Krinkle: so rc does supply the suppression we need?
[13:10] OK. I'm gonna zoom out for a minute.
[13:11] there's definitely a "logentry-delete-revision" message
[13:12] Let's consider a (simplified) version of MediaWiki: revision table, logging table, recentchanges table. Edits are saved in revision table. A summary of this is also saved in the recentchanges table with added info we only keep for 30 days. Non-edit actions (such as deleting pages, account creation and renaming pages) are logged in the logging table. The
[13:12] logging table is kept indefinitely and vieweable via Special:Log.
[13:12] Log actions, like edits, also result in the creation of a recent changes entry.
[13:12] All recent changes entries are also published through any RCfeeds the site has configured.
[13:13] the logging table has an attribute 'restricted'. Entries with this set are omitted when a user queries rows from the logging table for Special:Log.
[13:13] When log entries with 'restricted' are created, they also do not go to recentchanges.
[13:14] When an admin deletes a page in MediaWiki, Restbase needs to know about this so it can delete it there too.
[13:14] it's a more specialized version of the topic thing
[13:14] it would basically hard-code some mappings
[13:15] with topics, we could have events that aren't logged internally at all
[13:15] Page Delete events are normally public and as such in recent chnages.
[13:16]

@gwicke
Copy link
Author

gwicke commented Nov 13, 2015

[15:33] gwicke: hey
[15:38] ebernhardson: hey, I was just looking into php kafka producer options, and ottomata pointed me to https://github.com/wikimedia/mediawiki-vendor/tree/master/nmred/kafka-php
[15:39] he mentioned that it skips the zookeeper stuff, so I was wondering what the implications of that are
[15:39] gwicke: yes, that's what we are currently using in production. i stripped out the zookeeper part because it was only using zookeeper to get a list of brokers that were active for a partition
[15:40] gwicke: kafka added an api recently to get that info directly from kafka and skip zookeeper
[15:40] gwicke: so, in short there should be no downside it's just getting the data direct from kafka instead of from zookeeper. I didn't look into the kafka side of things but i'm imagining kafka probably query's zooekeeper for you
[15:40] okay, so it'll still handle master fail-over etc?
[15:40] gwicke: yes
[15:41] I see
[15:41] and this is faster than talking to ZK?
[15:41] gwicke: not sure about faster, but there were no good php level libraries for talking to zookeeper, we would have had to port a C level php module to hhvm
[15:41] weiboad/kafka-php#17
[15:42] it seemed that the info is queried per request
[15:43] gwicke: at least in the code i wrote it is cached inside the php process, but not across processes: https://github.com/nmred/kafka-php/blob/master/src/Kafka/MetaDataFromKafka.php#L120
[15:43] so this is per PHP web request?
[15:43] or across requests?
[15:43] gwicke: yes
[15:44] gwicke: per php request
[15:44] kk
[15:44] gwicke: in prod this is done after closing the request to the user (register_psp_function) so there is no user visible latency
[15:44] we are targeting fairly low volume stuff in any case (edit events), so it's probably fine
[15:44] well, it happens that way indirectly by using the 'buffer' flag in monolog on the channel, which pushes into DeferredUpdates, which uses register_psp_function
[15:45] so for edit events, you would want to do similar with DeferredUpdates most likely
[15:45] yeah, accumulate & then flush in a defferedupdate
[15:46] thanks, sounds like we have one more option for getting those events into kafka
[15:46] excellent, np
[15:46] i'd also like to use these events for making a more robust update process for elasticsearch, but that's probably a bit far off in the horizon :)
[15:46] we are trying hard to get this out before Christmas
[15:47] wish us luck ;)
[15:47] schemas are under discussion at wikimedia/restevent#5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment