addyosmani/preprocessing.md

## preprocessing.md

      
    Raw
  

              preprocessing.md
            
          
    Problem: How can we preprocess JavaScript (at build-time or on the server-side) so engines like V8 don't have to spend as much time in Parse? This is a topic that involves generating either bytecode or a bytecode-like-abstraction that an engine would need to accept. For folks that don't know, modern web apps typically spend a lot longer in Parsing & Compiling JS than you may think.

Yoav: This can particularly be an issue on mobile. Same files getting parsed all the time for users. Theoretically if we moved the parsing work to the server-side, we would have to worry about it less.
One angle to this problem is we all ship too much JavaScript. That's one perspective. We could also look at preprocessing.
We've been talking about this topic over the last few weeks a bit with V8. There were three main options proposed.


Similar to what optimize-js does. Identify IIFEs and mark them as such so the browser and VMs heuristics will catch them and do a better job than today. optimize-js only tackles IIFE but theoretically it could be expanded to other VM heuristics.


Second option is annotating the JavaScript code in code-block boundaries so the parsing would be faster as it doesn't have to do as much work. It knows what functions to prioritize looking at.


Sending down some form of serialized syntax-tree to the browser and theoretically the deserializtion process should be cheaper than parsing. We would replace parsing with this which would be cheaper.


V8: The reason Parse is slow is that we're parsing functions slowly. If you don't have an IIFE but a module containing many functions, if you parse 1st time you parse functions then inner functions. We reparse all the functions that are in there. If there's nested functions they need to be reparsed as well. So the preparse step is half of Parse time. If you don't need the code, good to not parse. We don't keep info about inner functions around. On the preparse step we want to keep inner-function info around so benefits of 'special' IIFEs go down by quite a lot. You save the preparse time. Theoretically strategies here would cut the time it takes in 4.
Ben from Facebook: At a high-level there's a tradeoff here. If Chrome was doing this from scratch surely JS wouldn't be the format you take your app in, you'd be doing something you could hand-tune. When can a web browser offer a more optimised input format? 2 cases. WebP and Brotli compression are examples of where we already do this. Question we are asking is: what would it take for Chrome to accept a different input format?
V8/Toon: we need to think about the benefits from it. If we address the current parsing bottlenecks as we know them we think we wouldn't see as many theoretical wins from preparsing.
Ben/FB: will we get to a point where we can send an app that is theoretically as large as a native app down to the web? Alex R: would it be good for users if we were to do this? Ben: there will be codepaths that don't get executed in most bundles yes, but as programmers we are going to want to ship features and code.
Alex: In EM we're seeing lots of cases where end-user benefit is better when there's less code shipped and that means lower parse costs.
Ben: We did some studies on our site when on our code, we change about 8% on the actual source code files that go into FB the site. But because we have to package those together in terms of the changes - we over a month - change the site 15 times over.  Thats a diff of 150x that we're shipping code. If we look at a world where we can just do delta-ships we're only needing to ship minimal code.
Alex: We're not too far from here. H/2 for granular modules. SW for doing diffs.
Ben: even a fast reparsing on a codebase that is large being shipped down is a lot. Lets say Chrome is bytecode caching and if we do an update to one function on Facebook, it would be great if it could keep the bytecode cache.
Yoav: Wouldn't this be addressed by some async explicit API? Ben: I don't think the experience would be visible to the user. I think you still want a native experience. The cost is proportional to your codebase.
Alex: This is what the Polymer team has been good at with PRPL - grabbing granular resources. Diff than Facebook - but in the background using SW they grab more granular resources that are loaded and cached. Allows granular loading only what is needed/changed. Can do invalidation for the application based on what is shipped.
Ben: my take on this is that getting engineers to break down their dependencies into a granular level becomes very tricky. Tempting to take very small interactions with the code..like think about writing C++ in Chrome. You don't think about including this one file and how many things am I potentially including. As an engineer, we're talking about telling people to optimize away a KB of code. In 2017 a programmer should not care that their class referenced an extra KB of code.
V8/Adam: Let's rewind. You have customers that have a blob of data and you want to just have that loaded but it's probably separate from the ergonomics point. Alex pointing at a slightly separate issue.
V8/Seth: one context. It's easy to look at the current state of the world where you may have a 2s block where you're parsing JS and think its spent tokenizing text into AST. In reality, a lot of that 2s which is inefficiencies which V8 is fixing. Some small part of it is tokenization. Some small part that a binary format could help with, but a large part is validating the semantics. Validating early errors. Where do scopes fit in. Where do functions start and end. A lot of this a browser has to do for correctness purposes. It would be useful, but we need to be specific about what part of Parse would be helped with here.
Yoav: are we talking about runtime correctness? or security? Runtime.
Ben: if early errors are a blocker, is this something that would be reasonable to change the semantics so that if you create a corruptible format..
Adam: Unsure what to do with the thought experiment of what if we didn't have JS. We're treating it as a black box. Toon: if we skipped inner functions, we would need to know about what it is doing. You would need to ship something that V8 understands.
Yoav: Why would you need the function boundaries?
Toon: you need to know what variables are used. We're building this type of file format internally which is very tied to V8.
Yoav: not talking about anything V8 specific...
Sean: we could generate a metadata format..
Ben: hypothetically if there was a component of Chrome that was, I'm processing this type of file, CSS/JS etc and I'm specific to this version of Chrome you have to update it. Imagine you have a part of processing that could be cloudified. Almost a service or a proxy. At least for FB and Akamai we have enough infra to serve this JS to billions its worth us to run. Even if it's for each browser, each version of Chrome.
Toon: you're also going to increase your load time..
Pat: How deep into the pipeline can you go? In Node we could output the optimal binary format.
Toon: I think the wins are theoretical. Alex: what are the scales of the wins? Toon: scanning is like 4%. It's hard to say what it would look like.
Ben: Chrome will always have incremental code caching that could be done in the cloud. If we have a file and we've only changed 1% of that file. If we could save costs by generating V8 bytecode. We update files frequently, but not a lot of this.
Alex: JS creates non-local impact for very small code-changes. V8 can't know that ahead of time. Love to talk how we can do more the first time the new file is loaded so it can be fast. Situation is store in SW cache, we store a lot of optimisation info. Unsure what the line is, but what if we could skip some of that and ran a preprocessing step.
Kenji: we've been exploring async updates with SW for some of these problems. Have to trade ability to update your app frequently.
Ben: Alex question I would ask is if FB could give you updated version of the file...
Alex: Just give us the diff in JS..
Ben: how much do we invest in building magic that keeps us in a world where we're sending text? How do the sites with large infra...Yoav: all sites have this problem. How is current caching being done. It's being done on a file basis.
Kouhei: security is an issue. Can we sign the binary so Chrome can trust. Second thing is now we're investigating [unheard] so we can generate cache metadata for JS files so when we load JS through SW we already have the optimized version.
Sean: maybe useful for library authors where we have an in-between, intermediary fashion. Intermediate not user code but library code.
Jochen: it should be in the browser cache instead of a binary format.
Yoav: if cached in a file-basis and its a framework (angular, ember) a different version of the same identical version is there, you'll still fetch it because it came from a different URL. This may be a different problem because it's going to be a URL addressable cache.
Matt: one thing you skipped was that the binary thing you would serve to the browser from my understanding is that it removes V8 from being able to update in the background. One version of Chrome to the next and if it's a binary you might not get those wins.
Yoav: I don't think it should be something that fundamentally changes.
Addy/Ross: effectively an abstraction on top of Ignition Bytecode. Minified JS is already pretty compressed so you need more metadata and so its bigger than original JS. Deciding on the format is important. Is it cross-browser? thats important probably. Unsure what other browsers need to do here for scopes.
Alex: real risk here that there isn't a collaborative way to create this format. The good-functioning of the ecosystem would be at risk.
Ben: think the risk is low if it's just another delivery mechanism.
Seth: there's a cost of maintaining two versions forever. If someone never updates you'll need to maintain this forever.
Jochen: you have your normal JS + the regular script. This works very similar to accept headers. In JS you would need to explore not working with toScript()
Yoav: not thinking of this as a propri thing.
Seth: worth considering what a binary would do based on technical merit. Brotli is agnostic to what goes over the wire. A binary format for JS would have to be tied to the semantics of JS and makes it much harder to make it future proof. As a POR WASM has taken 2.5 years to design a binary format that we don't have to worry about version wise. So much hanging fruit in what we can do about caching functions. Making ES6 modules a faster thing that I would encourage us to solve this as a general problem rather than focusing on the tokenization problem. In terms of spending time.
Yoav: fair. I think it all depends on how much these improvements will get us. What happens in other engines.
Toon: the parens around your critical functions would give you some wins again
Sean: if we're at the ceiling shipping smallest code we can ship and we're doing as much as we can for heuristics to parse the most efficiently. We have people doing as much as they can to ship small code but it still isn't quite enough. Using as many tricks.
Toon: if you eagerly compile too much there are also other risks. I wonder once we're done with this work on improving Parse we'll still really get benefit from precompilation.
Jacob: we're wasting time by pulling things in and parsing them multiple times. The parser wants to parse stuff first that will be used first - because..that helps the UX? In your perfect scenario you would parse the first thing you need first..and things would parse in that order. We ended in that situation. What is the improvement at that point?
Toon: Depends on the app. If shipping one module that needs 20 functions.
Yoav: out of time but let's continue to discuss.
Seth: another session later on about V8's parsing work and we can keep the discussion going.