larsbergstrom/gist:9539dae389987037ac17

## gistfile1.txt

# agenda

- painting tiles in parallel
- per-tile display lists
- avoid painting
- omtc

 # per-tile display lists

- matt: currently, have to do the full viewport, which is a big overhead but has to be done due to layerization-related decisions. Idea is to split the screen into a set of tiles with lyayers per tile. If you get a repaint, you repaint the tile
- jeff: do layers not cross tiles?
- matt: have to fix-up
- pcwalton: Servo is more like webkit (i.e., broken). On each flow/fragment (frame) we decide whether it wants a layer or not. Display lists are per stacking context.  It’s a tree of stacking contexts with flags on whether you have a layer or not. If you don’t, you are part of your ancestor’s layer. Display lists are flat inside of a stacking context. We would change this so each context would have tiles and the tiles would have display lists. That avoids the problems where tiles straddle layers because they don’t - display lists are per-stacking-context. But, it’s broken because there are layers that are not stacking contexts (because of the spec). So we’d like to keep this approach but instead of stacking contexts having a flag, they have layers that have display items, adding a tier. MLPSC - multiple layers per stacking context.
- jeff: sounds sane.
- pcwalton: We build these display lists, shuffle them into the right order and then in the painting thread, we do display list optimization to remove things outside the display port. We should also cull invisible stuff, but we don’t yet. Also cull outside the tile (since Skia is so bad about that). Finally, do DLBI (display-list based invalidation). We diff display lists.
- jeff: size of tiles
- gw: 512x512
- pcwalton: Also, display port is about 8x the size of the view port. The display port is just the area we build display lists for. We don’t keep tiles for them, though - just the display lists.
- nical: that would be nice!
- pcwalton: Works because our display lists have no connection to the flows/frame tree. They have an opaque ID of the DOM node for hit testing / hover, but that’s it. Can be serialized, processed in parallel, etc.
- nical: Have PaintWorker?
- pcwalton: Yes. We just round-robin distribute the tiles to threads. Optimization is done during the painting of the tile (to avoid Skia issues) on their own copy.
- nical: Worker does all commands for a tile?
- gw: Yes.
- jeff: Pinch to zoom?
- pcwalton: Yes, And we’re only Async pan zoom. Paint when you release.
- jeff: Layout repaint or just use the display list?
- pcwalton: Just the display list, unless you leave the display port. That’s why our display port is so large. Also, since layout is separate from JS, we can ask for more display lists from layout even if script is running.
- jeff: Chrome is starting to repaint while you pinch.
- pcwalton: Easy to do for us
- matt: It looks nice.
- pcwalton: We can also do zoom text-only for every frame, since layout is so fast, we can do that and it looks even more amazing. I want to do it on mobile.
- nical: Other things on multithreaded painting? We did try not optimizing per-tile, but Skia killed us due to their really late culling. We’re looking into it and wanted to know if there was any insight.
- pcwalton: Definitely optimization per-tile.
- A: Why 512x512?
- gw: Arbitrary.
- A: Just Skia?
- pcwalton: Yes. In theory other things, but not really.
- nical: How’s skin working?
- larsberg: Have some issues with GPU painting, but CPU is fine. Non-parallel on GPU.
- nical: box shadows?
- pcwalton: Intermediate surfaces that are larger, blur it, then clip it. Per-tile.
- nical: We’re looking at either that or synchronously cover it for all tiles.
- nical: non-tiled layers like canvas? or small animated things?
- gw: Only makes the tile as big as it needs to be. 600x600 = 512x512 + 512x88 tiles.
- nical: recycle per-layer?
- pcwalton: Yes, but working on doing it globally.
- jeff: What about specially-sized ones?
- pcwalton: If you’re scrolling down, you might reuse, so we do hold them and bin based on size. Most of the tiling code is in our rust-layers project. The backend is opengl-only, but is designed for other ones.
- nical: Usually not that bad.
- pcwalton: Don’t know if we’ll do D3D or ANGLE.
- jeff: We don’t use ANGLE but don’t have any good reasons. The code required is really small, though. And ANGLE is not simple code; so a lot of complexity.
- pcwalton: Direct3D does not work with Angle surfaces easily. I’m sure we can, but hard.
- nical: We do that with webgl. But stability and graphics is hard with sharing surfaces…
- larsberg: Lots of issues with parallelism and stability at the beginning.
- jeff: How do you do surface sharing?
- pcwalton: GL textures with upload from the paint threads to remove compositor bank.
- jeff: How do you share?
- pcwalton: xpixmaps on Mac, eglimage on Linux, iosurface on mac. Iosurfaces works well.
- nical: e10s?
- pcwalton: Trying to restrict GPu access from the content process. We proxy all canvas, webgl, and display lists to the chrome process, which does all the painting. Content never does vector graphics or gpu.
- nical: Neat, but if you can get the compositor to crash and not take down servo and keep going (e.g., with the CPu backend) would be great.
- jeff: We recover…
- nical: Sometimes we don’t. Even if it’s just recover w/o switching backends. If we lose the device, we are using a given backend for the life of the process. Also, only crashes on release and not nightly.
- pcwalton: We could have off main thread compositing; could do off-process compositing.
- nical: Chrome has this.
- jeff: IE doesn’t and has more configs.
- pcwalton: I didn’t know why Chromium was three process, but I suspect it’s due to GPU crashes. Maybe we should do that.

# blacklists

- matt: Both too late (hundreds of thousands of crashes later) and not always deterministic. So now we’re adding runtime tests.
- pcwalton: Do crashes go down with later versions of windows?
- nical: Possibly yes.

# gpu vs. cpu painting

- larsberg: for us, it was crashing with gnu painting - same for you?
- nical: mainly around surface sharing. Updating constant buffers causes the end of the pipeline to fail, but if you share the surfaces slightly differently, it worked better.
- matt: texture sharing just isn’t used by many people
- pcwalton: Just OS compositor for simple things.
- matt: that’s a huge problem
- pcwalton: Don’t know how we avoid it for OMT painting.
- nical: Upload from the compositor instead.
- pcwalton: Roc said chromium is moving towards more stuff in the compositor
- matt: there’s just one GPU, so why not?
- jeff: They have a compositor per tab

# gpu painting

- pcwalton: Wondering if GPU painting is even the right thing. Using GPU vector API at all? If you worked on display lists, you could do a better job. Instead of arbitrary vectors, maybe should have a renderer that’s just doing nsDIsplayList and kicked off to azure/skia for things it didn’t do, could avoid the pathological cases
- nical: that’s why I was excited about a scene graph based renderer. It gives a bunch of objects instead of drawing commands. Better than the moz2d glyph drawing to instead just pass display list
- jeff: Here, talking about the level of operation higher. e.g., CSS-like instead of a bunch of vectors. So, draw a box with a kind of border instead of a bunch of vectors or a scene graph retained model.
- pcwalton: Yes, I think high-level border CSS commands is that asking skia to draw a dashed path is crazy. Because we could just write a shader. Then, for SVG and canvas fall back, but just them. The common stuff is super simple.
- jeff: D2D takes in the commands and merges stuff.
- pcwalton: But have to reconstruct it from low-level commands.
- jeff: Can do cross-element merging. Like if you construct commands per CSS box, you can’t do text in one draw call.
- gw: Can batch at the tile level, so if you have borders in each tile, can create a single batch.
- pcwalton: Batching renderer can also know about CSS things.
- gw: I think it would benefit more from that. The shader combo and uniform buffer combos are equivalent for CSS borders or box shadows.
- pcwalton: Just don’t want to do Cpu tessellation. Even for rounded.
- jeff: Could do it in a shader, but sure if you need to. Disadvantage in a shader is you might have to switch. D2D has drawGeometry.
- pcwalton; I guess I’m basically just wondering why SkiaGL isn’t better.
- nical: Huge cache of things because it doesn’t know what’s used in each frame, because the API retains more info.
- jeff: I disagree.
- pcwalton: yeah, I’m wondering why ganesh is only 50% better, not 5x
- nical: It’s not a GPU drawing API!
- pcwalton: If we really want retained scene graphs, servo’s almost ready one due to our display lists. Without much work, we could hang resources off the display items. Then, when DLBI determines items are the same, retain GPU objects. No reason we couldn’t do this.
- nical: I would do this! But, I’m a little too excited about scene graphs. And how close your display lists are to CSS.
- pcwalton: Yeah, and there aren’t many of them. 8 now; need SVG, though. Haven’t figured out what we’re doing there. Would like to just do it via Canvas.
- nical: Lots of corner cases.
- pcwalton: Not sure the intersection of supported SVG is that big.
- jeff: Really? IE’s is bad?
- pcwalton: I’m talking about like iframe inside SVG.
- jeff: All the text layout I think they do, which is not great in canvas. No good text API.
- pcwalton: Kinda wondering how much work it is for bad SVG.
- jeff: It’s really not that bad.
- nical: Just don’t do a GPU renderer yourself on it. Self-intersecting shapes are bad.
- jeff: I don’t think you gain anything by doing it in canvas.
- pcwalton: Faster to stand it up.
- jeff: You will have to bite the bullet anyway, because the hack is not shippable.
- nical: More painting on the compositor is for animations with the border radius…
- jeff: They cut that, as of two weeks ago.
- pcwalton: Not an issue in servo; animations requiring layout can happen off main thread.
- nical: Still bandwidth, though.
- matt: Instead of rasterizing into an intermediate buffer, just rasterize straight to the window, avoiding one copy/allocation.
- gw: Calls directly instead of uploading the texture.
- nical: allocating them is brutally slow, which you need if it resizes.
- jeff: If you have a page in servo with a box that resizes, it’ll work just fine. you need to re-render at some point.
- pcwalton: That approach avoids the intermediate texture, but…
- jeff: If you are software renderizing, you still have to re-render.
- nical: That’s the screen for them. It’s like not having layers for that part of the content.
- pcwalton: We could synchronously pass the gl context to the paint thread. Or merge the paint thread to the compositor. Saves one intermediate render target.
- gw: Also save a texture upload if you’re rendering on the GPU.
- pcwalton: Only for CPu painting, not GPu painting.
- nical: Can keep it in mind, but not urgent.
- pcwalton: I feel like render to texture is really optimized.
- jeff: There’s a power cost.
- gw: Also really bad on mobile on tiled architectures.
- jeff: People have done a lot of engineering to save these copies. Chrome has an extra one.
- nical: They’ve been working on removing this for ages.
- pcwalton: If we have to pass the context to the paint thread, we could do it.
- gw: Remove it; have the compositor run that.
- pcwalton: I still don’t want to do it because it’s the total inverse of what you want to do for CPU painting.
- nical: Yeah, I’d just try it for GPU painting, if the experiments show it blows things away.
- pcwalton: I really want scrolling to be uninterruptable by painting in the background so it always moves quickly. Great with CPU painting. But with GPU painting, it janks your scrolling.
- nical: The Google guys are working on it. They have Khronos stuff to assign priorities and to be able to change them dynamically.
- jeff: Need preemption. Microsoft has been asking for that for a while. The newest AMD hardware does have it.
- pcwalton: Need separate paint threads then, though. That’s another reason I don’t want to rush to everything on the compositor.
- gw: Does the preemption have anything to do with the CPU thread? Surely, just changing which command buffer is executing. The CPU calls should all be async, unless you’re doing synchonous readback.

# layerization

- nical: Just remember to keep layerization super flexible. It hit us really hard on FFOS and Gaia has to be able to pass it down.

# specialized commands

- gw: Would love to show that you can do a specialized GPU shader for the web. THe other advantage is that you're working like games and not hitting the buggiest pathes.
- lee: Still buggy.
- nical: Games still don't use offscreen surfaces. I don't like Skia.
- lee: advantage is just being off our ~2006 Cairo.
- nical: Spend like 60% of your time preparing and touch pixels very little of the time. Redo lots of work. If we just used Cairo or Skia on many tiles in parallel, we just have to do things twice.
- gw: When we shape text, we could start building the glyphs.
- nical: That's what the Enlightenment people do; prepopulating the cache.
- pcwalton: We have it early anyway since we need it for intrinsic widths.
- nical: Bas says it's less important on the web. You will build it in the first tab...
- pcwalton: Different font sizes, though. Different font faces over and over.

# stability & perf

- nical: This is why chrome has a third process - handle the crashes and recover.
- mbest: Detecting and recovering, switching back to software rendering will help. What's the downside?
- nical: WebGL penalty.
- pcwalton: We pay the cost already in Servo.
- nical: Painting & compositing are where the problems are happening for us. We keep having new problems. Using the GPU from multiple threads is something few other things do in the wild.
- jgilbert: Other people do it.
- nical: Window manager does, but it's really simple.
- jgilbert: Separate processes if it crashes is fine. But the most common failure is not that. Even on good machines, I get problems with running out of memory and things just going black.
- nical: If you are running out of memory, the most sensible thing to do is crash a content process. Was thinking of trying to recover, etc. but it's way too hard & out of control. Once half your allocations are going to fail, Gecko is in trouble. We could try to recover/free stuff.
- jgilbert: The experience is miserable, though. The question is - how much do we win by doing a huge amount of work on the GPU, especially for content?
- nical: Right now, we’re doing it wrong.
- pcwalton: Google quotes 50% perf improvement for ganesh.
- gw: Biggest improvements at big resolutions.
- nical: After we finish OMTC, we can start looking at some of this.
- pcwalton: Interested in writing some Rust?
- nical: In two years :-)
- larsberg: Alternatively, what can we do to help?
- gw: We could prototype all of this custom rendering stuff in Servo and see if it works to measure the benefit over SkiaGL before Gecko does it.
- pcwalton: It’s in line with what we want to do.
- nical: It sounds great; falling back to SVG is great.
- jgilbert: We have some problems with our reftests; they use Canvas and are fragile w.r.t changes.

# GL APIs

- jgilbert: Extensions? I’m on Khronos.
- pcwalton: Priorities / niceness would help for all our parallel threads doing GL calls. Would help avoiding accidentally janking the compositor with all our GPU painting/operations. Also GL swap buffers are painful; if we had async that would help. Right now glSwapBuffers blocks; would be nice if it didn’t so we could handle events. We can work around it by going to a different thread.
- gw: Also, DX12 and Vulkan with multithreaded command buffers.
- nical: Streaming stuff to the GPU without the driver doing something unexpected.
- jgilbert: Number of these are better on GLES3 for attaching and blitting things across. Copy texImage will work easier to avoid marshalling/unpacking things. Modern phones just have this support. Android L has GLES3 plus extensions.
- mwu: If you have Gonk, they have really nice low-level things for sending textures, like gralloc.
- pcwalton: eglImage on Android; iosurface on mac, xpixmaps on X.
- nical: Pain with genlock because it kept changing every version of Android. Now we just get fences and handle it all ourselves.
- mwu: Well, it was random in older versions.
- sato: The older versions are still useful for testing because they’re more conservative.
- nical: But with the explicit fences you can avoid the genlock issues entirely. Anyway, X11 & Gralloc are not as much. Bugs are bigger.


# Servo in Gecko

- mbest: Anything that is ready to go here?
- pcwalton: Talking about pulling in maybe CSS animations in.
- mbest: Any big sell? Performance or stability? I’d like to see those more.
- pcwalton: Parallel layout and off main thread layout are the super high-value things, but mainly because they’re hard to do in other browser engines. Plenty of things we can do to use in both engines that we could start working on with Rust in Servo. Replacing image decoders like libjpegturbo, which is super valuable for both Servo and Gecko.
- mbest: Does that help?
- pcwalton: Perf and security are the two big issues with that library.
- jack: Could decode it on the GPU.
- lee: But then you have to bring it down from the GPU if we’re CPU rendering…
- mbest: Doing something on the critical path would be best. Something that’s hitting our users and is causing complaints.
- pcwalton: I’m getting that there are a lot of things Gecko wants to do that we also need to do. Would be great if we can do them or get somebody from Gecko doing them in Rust and sharing them.
- mbest: It would be nice if we were moving toward the same target, since it will keep priorities aligned. If you have a list out of this, please send them to me.
- pcwalton: There’s plenty of stuff we’re also pulling from Gecko like opentype sanitization. We use Moz2d and spidermonkey. Shader sanitization is something I want to use from Gecko, too.
- jgilbert: If we move away from ANGLES, though, that’s something we both would want.
- pcwalton: Video codecs we want to use. Audio codecs, etc.
- mbest: Roc has been talking about using Rust in media.
- pcwalton: We’re trying to share, but eventually, we’ll hit a gap where the last bit (layout+script) is hard to close. Was talking with ehsan about doing the same bindings generator as Gecko so we can generate Rust code that hooks into Gecko for new features.
- mbest: Hardest Servo thing is a path that delivers in a good timeframe. I’d like to talk about that over the next quarter.
- larsberg: We have a few places like Necko where they’re in good shape so we’re just building our own stack. I’d love help figuring out how we make good decisions.
- mbest: That kind of stuff is my top priority to work on. Gotta figure out how to reduce our scope.
- pcwalton: I like the OS9->OS10 analogy.
- mbest: Let’s get some early examples to prove the case and come back with a plan. CBeard will want it. I think gfx is a huge place to show prowess of improvement. Where can we have the most impact?
- pcwalton: Fast content-focused rasterizer would have a huge impact. Fits all the sweet spots and reuse.
- nical: Maybe hard to reuse…
- gw: Even if it is, having a working design and numbers is great.
- jack: At the end of the day, it’s shaders and you should reuse it.
- mbest: Need a good plan forward. The incremental reuse and shipping in Gecko plan is key. What do you think we should do Jeff?
- jgilbert: Display list rendering sounds good to me. Needs a marshalling layer, but honestly I think that we would benefit from having a little more API separation.
- nical: I’m worried that stuff that gives Servo an edge is infra work like taking stuff off the main thread. It will take a long time to catch up on the architectural part.
- pcwalton: Layout and DOM are super-coupled and you just have to replace that part. Don’t know how to bridge it until we’re just really good and nearly done.

# Other big roadblocks?

- mbest: What else is a big fear here?
- pcwalton: Addons and XUL in Gecko. But I recommend using Gecko for Chrome and Servo for content. MS is doing it; so clearly it’s reasonable. Another issue is webcompat and user agent sniffing. Not sure how much we can get away with and how much a new browser will run into. Edge is working on this now, but we’ll have to tackle it, too. Better than FF in the early days, though. There are some random FF things, like non-standardized gonk APIs, devtools integration, etc. Some other tools like memory profiling, etc. that are just needed to be in Firefox. Would have been much harder before e10s on the desktop.
- larsberg: If we decide we want to push Servo in Gecko all the way to release earlier, we may need to invest in improving a set of Rust-related work.
- mbest: Definitely. If we have a longer-term plan with an arc, it will help define when different Rust bits need to land and help us make a resourcing plan.
- larsberg: Need early wins.
- mbest: Yes, plus very deep vetting and serious planning, involving a broad spectrum of people.
- mbest: Our big issues to address are: 1) how do we show improvement, 2) training people up, 3) when is the big surgery point.
- nical: One issue if we do the two-engine approach is size.
- mbest: Do we have numbers?
- mbrubeck: FF has doubled in size in two years.
- pcwalton: If we share most components well, we’re probably only a few megabytes.

# Next steps

- mbest: What’s next?
- larsberg: Q3 servo planning for a graphics experiment. Work with mbest on planning.
- mbest: And train me up on Rust. Let’s do this!

# FFOS issues

- sotaro: The IPC that has to go through the main thread causes serious blocking on FFOS, especially with video playback. That is my biggest problem.
- pcwalton: We do not require IPC to go through the main thread in Servo.
- mbest: Is there something we can do to fix that?
- sotaro: We’re being encouraged to move to ServiceWorker instead, once that is done. In FF MediaFramework, I have to use Android IPC, which is painful.
- jack: Should rust-media land in a sandboxed content process?
- pcwalton: Probably in the compositor process, because it needs access to the hardware decoders.